# Reproducing *Emergent Introspective Awareness in LLMs* — End‑to‑End Notebook

This notebook runs the four experiments from the paper and recreates the **layer‑wise** plots using the harness you downloaded.

**What this notebook does**  
1. Unpacks/sets up the harness (or uses an existing `src/` in this working directory).  
2. Installs dependencies.  
3. Lets you choose a model and runtime (4‑bit/8‑bit/FP16).  
4. Runs:
   - **Injected thoughts** (layer/strength sweep)
   - **Thought vs text**
   - **Prefill intention**
   - **Intentional control**
5. Produces **layer‑wise plots** that match the paper’s figures.

**Sources from the paper** (Appendix) this notebook aligns to:
- Concept vectors on the *final `:` token* in `Human: Tell me about {word}\nAssistant:` and subtracting the mean baseline over 100 words (pp. **36–37**).  
- Injection windows and experiment prompts (pp. **38–42**, **48–58**).  
- Layer‑wise lines and typical peaks (pp. **15**, **21**, **24**, **28–29**).

> Note: Exact **numbers** may differ (you’re using open‑source HF models; the paper used Anthropic internal models), but you should recover the **qualitative phenomena** and **layer/strength** trends.


## 0) Get the code into this working directory

- If you have the ZIP handy (e.g., `introspection_repro_with_plots.zip`), run the cell below to unpack it.  
- If you already have a `src/` folder here, you can skip unpacking.


In [None]:
#@title Unpack the harness zip (edit ZIP_PATH if needed)
import os, zipfile, glob

# Change this to the name/path of the ZIP you uploaded to this notebook runtime
ZIP_PATH = 'introspection_repro_with_plots.zip'  # or 'introspection_repro.zip'

if os.path.isdir('src'):
    print('Found ./src — skipping unzip.')
elif os.path.exists(ZIP_PATH):
    with zipfile.ZipFile(ZIP_PATH, 'r') as z:
        z.extractall('.')
    print('Unzipped:', ZIP_PATH)
    print('Top-level entries:', os.listdir('.')[:10])
else:
    print('No ./src and no ZIP_PATH found. Please upload the ZIP to this directory or clone the repo here.')

## 1) Install dependencies

- Make sure your CUDA/Torch versions match your GPU drivers.
- If you already have Torch installed, you can skip reinstalling it.

> **Tip:** For CUDA 12.1+ wheels on Linux: `pip install torch --index-url https://download.pytorch.org/whl/cu121`


In [None]:
#@title Install Python packages (uncomment Torch line if needed)
# !pip install --upgrade pip
# If you need a matching Torch build, uncomment ONE of these lines:
# !pip install torch --index-url https://download.pytorch.org/whl/cu121   # CUDA 12.1
# !pip install torch --index-url https://download.pytorch.org/whl/cu118   # CUDA 11.8

# Install the experiment harness deps
!pip install -r requirements.txt

## 2) Wire up API keys for the **LLM grader**

The experiments use an **LLM judge** (Anthropic / OpenAI / OpenRouter) with the paper’s grading prompts (Appendix pp. 39–42, 54).  
Set any one (or more) of these in the environment for this notebook process.


In [None]:
#@title Load environment variables from the .env file
from introspect_repro import load_project_env

load_project_env()
print("Loaded variables from .env (existing environment values are preserved).")


In [None]:
#@title Set API keys for grading (fill values or leave to inherit from your environment)
import os

# Fill in only the ones you plan to use. Leaving blank keeps the existing environment value.
ANTHROPIC = ""  # e.g., "sk-ant-..."
OPENAI = ""     # e.g., "sk-..."
OPENROUTER = "" # e.g., "or-..."

if ANTHROPIC: os.environ['ANTHROPIC_API_KEY'] = ANTHROPIC
if OPENAI:    os.environ['OPENAI_API_KEY'] = OPENAI
if OPENROUTER:os.environ['OPENROUTER_API_KEY'] = OPENROUTER

print("Anthropic:", "set" if os.environ.get("ANTHROPIC_API_KEY") else "not set")
print("OpenAI   :", "set" if os.environ.get("OPENAI_API_KEY") else "not set")
print("OpenRouter:", "set" if os.environ.get("OPENROUTER_API_KEY") else "not set")

## 3) Configure model and runtime

- **HF model**: any decoder‑only chat/instruct model with access to internals (e.g., Llama‑3‑8B‑Instruct, Qwen‑2.5‑7B/14B, Mixtral).  
- **Precision**: FP16 if you have VRAM; otherwise 8‑bit/4‑bit.  
- **Judge**: pick provider/model for grading.


In [None]:
#@title Choose your HF model and judge
HF_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"  #@param {type:"string"}
LOAD_IN_4BIT = True                                #@param {type:"boolean"}
LOAD_IN_8BIT = False                               #@param {type:"boolean"}
DTYPE = "bfloat16"                                  #@param ["bfloat16","float16","auto"]

JUDGE_PROVIDER = "openai"                           #@param ["openai","anthropic","openrouter"]
JUDGE_MODEL = "gpt-4o-mini"                         #@param {type:"string"}

N_TRIALS_INJECTED = 30                              #@param {type:"integer"}
N_TRIALS_TVT = 50                                   #@param {type:"integer"}
N_TRIALS_PREFILL = 50                               #@param {type:"integer"}
N_TRIALS_INT_CTRL = 16                              #@param {type:"integer"}

print("Model:", HF_MODEL)
print("Judge:", JUDGE_PROVIDER, "/", JUDGE_MODEL)


### Add the harness to `sys.path` and compute a good default layer

We avoid loading full weights just to count layers by reading the config.


In [None]:
#@title Prepare imports and compute a 2/3‑depth layer guess
import os, sys, glob, math, subprocess, shlex, json, pathlib, time
from IPython.display import display, Image

# Add ./src to import path
SRC_CANDIDATES = ["./src", "../src", "/workspace/src"]
for c in SRC_CANDIDATES:
    if os.path.isdir(c) and os.path.abspath(c) not in sys.path:
        sys.path.insert(0, os.path.abspath(c))

# Light‑weight: use AutoConfig to count layers
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained(HF_MODEL)
NUM_LAYERS = getattr(cfg, "num_hidden_layers", None) or getattr(cfg, "n_layer", None)
LAYER_2_3 = int(round(0.66 * (NUM_LAYERS - 1))) if NUM_LAYERS else None
print("num_hidden_layers:", NUM_LAYERS, "|  ~2/3 layer:", LAYER_2_3)

# Convenience to run a python module from this kernel
def run_module(modname, args_list):
    cmd = [sys.executable, "-m", modname] + list(map(str, args_list))
    print(">>>", " ".join(cmd))
    subprocess.run(cmd, check=True)

def latest_run_dir(exp_name):
    cands = glob.glob(os.path.join("runs", "*", exp_name))
    return max(cands, key=os.path.getmtime) if cands else None


## 4) Experiment A — **Injected thoughts** (layer & strength sweeps)

Protocol (Appendix pp. 36–42): compute concept vectors on the `Assistant:` colon token, subtract baseline mean (100 words), then inject on the Assistant turn of the **“injected thoughts”** prompt and judge **coherence ∧ affirmative ∧ correct identification (before saying the word)**.

In [None]:
#@title Run injected thoughts (layer/strength sweep)
LAYER_FRACTIONS = [0.60, 0.66, 0.70, 0.75, 0.80]  #@param
STRENGTHS = [1,2,4]                               #@param

args = [
    "--model", HF_MODEL,
    "--judge-provider", JUDGE_PROVIDER,
    "--judge-model", JUDGE_MODEL,
    "--n-trials", N_TRIALS_INJECTED,
    "--sweep-layers", *LAYER_FRACTIONS,
    "--strengths", *STRENGTHS
]
if LOAD_IN_4BIT: args.append("--load-in-4bit")
if LOAD_IN_8BIT: args.append("--load-in-8bit")
if DTYPE and DTYPE != "auto": args += ["--dtype", DTYPE]

run_module("introspect_repro.experiments.injected_thoughts", args)

In [None]:
#@title Plot injected thoughts: layer‑wise lines
INJECTED_RUN = latest_run_dir("injected_thoughts")
print("Run dir:", INJECTED_RUN)

# Pick one strength to match the paper’s typical layer‑wise figure (strength≈2)
PLOT_STRENGTH = 2
png = os.path.join(INJECTED_RUN, f"layerwise_strength{PLOT_STRENGTH}.png")

run_module("introspect_repro.plotting.plot_injected_thoughts",
           ["--run-dir", INJECTED_RUN, "--strength", PLOT_STRENGTH, "--save", png])

display(Image(filename=png))

## 5) Experiment B — **Thought vs Text**

Protocol (pp. 20–22; Appendix pp. 48–51): inject an unrelated word over the sentence tokens, then ask the model (a) **what word it thinks about** (judge “YES”), and (b) to **repeat the sentence exactly** (string match).

In [None]:
#@title Run thought vs text at a single layer
TVT_LAYER = LAYER_2_3 if LAYER_2_3 is not None else 12  #@param {type:"integer"}
TVT_STRENGTH = 2                                        #@param {type:"integer"}

args = [
    "--model", HF_MODEL,
    "--layer", TVT_LAYER,
    "--strength", TVT_STRENGTH,
    "--n-trials", N_TRIALS_TVT,
    "--judge-provider", JUDGE_PROVIDER,
    "--judge-model", JUDGE_MODEL
]
if LOAD_IN_4BIT: args.append("--load-in-4bit")
if LOAD_IN_8BIT: args.append("--load-in-8bit")
if DTYPE and DTYPE != "auto": args += ["--dtype", DTYPE]

run_module("introspect_repro.experiments.thought_vs_text", args)

In [None]:
#@title Plot thought vs text — layer‑wise lines (use latest run folder)
TVT_RUN = latest_run_dir("thought_vs_text")
png = os.path.join(TVT_RUN, f"tvt_layerwise_strength{TVT_STRENGTH}.png")

run_module("introspect_repro.plotting.plot_thought_vs_text",
           ["--run-dir", TVT_RUN, "--strength", TVT_STRENGTH, "--save", png])

from IPython.display import Image, display
display(Image(filename=png))

## 6) Experiment C — **Prefill intention** (apology rate)

Protocol (pp. 22–25; Appendix pp. 53–55): prefill an **unrelated word**, ask whether it was intended; then **retroactively inject** the concept corresponding to the prefilled word **prior** to the prefill. Judge **apology vs intended**; plot **apology rate** vs layer (lower is better). Peak is typically **earlier** than the ~2/3 layer.

In [None]:
#@title Run prefill intention
PREFILL_LAYER = max(0, int(round(0.55 * (NUM_LAYERS-1)))) if NUM_LAYERS else 12  #@param {type:"integer"}
PREFILL_STRENGTH = 4                                                               #@param {type:"integer"}

args = [
    "--model", HF_MODEL,
    "--layer", PREFILL_LAYER,
    "--strength", PREFILL_STRENGTH,
    "--n-trials", N_TRIALS_PREFILL,
    "--judge-provider", JUDGE_PROVIDER,
    "--judge-model", JUDGE_MODEL
]
if LOAD_IN_4BIT: args.append("--load-in-4bit")
if LOAD_IN_8BIT: args.append("--load-in-8bit")
if DTYPE and DTYPE != "auto": args += ["--dtype", DTYPE]

run_module("introspect_repro.experiments.prefill_intention", args)

In [None]:
#@title Plot prefill intention — apology rate vs layer
PREFILL_RUN = latest_run_dir("prefill_intention")
png = os.path.join(PREFILL_RUN, f"prefill_apology_layerwise_strength{PREFILL_STRENGTH}.png")

run_module("introspect_repro.plotting.plot_prefill_intention",
           ["--run-dir", PREFILL_RUN, "--strength", PREFILL_STRENGTH, "--save", png])

from IPython.display import Image, display
display(Image(filename=png))

## 7) Experiment D — **Intentional control** (think vs don’t think)

Protocol (pp. 26–29; Appendix pp. 56–58): instruct **“Think about {word}”** versus **“Don’t think about {word}”**, record residuals during generation, and plot mean **cosine(model activations, concept vector)** vs layer for both cases.


In [None]:
#@title Run intentional control across several layers
LAYERS_TO_RUN = [0.50, 0.60, 0.66, 0.75, 0.85]  # as fractions of depth
layers = [max(0, min(NUM_LAYERS-1, int(round(fr*(NUM_LAYERS-1))))) for fr in LAYERS_TO_RUN]

for L in layers:
    args = [
        "--model", HF_MODEL,
        "--layer", L,
        "--n-trials", N_TRIALS_INT_CTRL
    ]
    if LOAD_IN_4BIT: args.append("--load-in-4bit")
    if LOAD_IN_8BIT: args.append("--load-in-8bit")
    if DTYPE and DTYPE != "auto": args += ["--dtype", DTYPE]

    print("\n=== Intentional control at layer", L, "===")
    run_module("introspect_repro.experiments.intentional_control", args)

In [None]:
#@title Plot intentional control — mean cosine vs layer
INTC_RUN = latest_run_dir("intentional_control")
png = os.path.join(INTC_RUN, "intent_control_layerwise.png")

run_module("introspect_repro.plotting.plot_intent_control",
           ["--run-dir", INTC_RUN, "--save", png])

from IPython.display import Image, display
display(Image(filename=png))

## 8) (Optional) Summarize injected‑thoughts metrics as a table

The “awareness” column = **coherence ∧ affirmative ∧ correct identification**, matching the Appendix grader criteria (pp. 39–42).


In [None]:
#@title Build a summary table
import json, os, glob
import pandas as pd

from introspect_repro.plotting.utils import load_results

INJ = latest_run_dir("injected_thoughts")
rows = []
for (layer, strength, j, f) in load_results(INJ):
    trials = j["trials"]
    coh = [t.get("coherence") for t in trials]
    aff = [t.get("affirmative") for t in trials]
    cor = [t.get("correct_identification") for t in trials]
    aware = [ (t.get("coherence") and t.get("affirmative") and t.get("correct_identification")) for t in trials ]
    rows.append(dict(layer=layer, strength=strength,
                     n=len(trials),
                     coherence=sum(1 for x in coh if x)/len(coh) if coh else 0.0,
                     affirmative=sum(1 for x in aff if x)/len(aff) if aff else 0.0,
                     correct_id=sum(1 for x in cor if x)/len(cor) if cor else 0.0,
                     awareness=sum(1 for x in aware if x)/len(aware) if aware else 0.0))
df = pd.DataFrame(rows).sort_values(["strength","layer"])
df.style.format({c:"{:.3f}" for c in ["coherence","affirmative","correct_id","awareness"]})

---

### Tips & troubleshooting

- **VRAM errors** → switch to `LOAD_IN_4BIT=True`, set `DTYPE="bfloat16"`, reduce `N_TRIALS_*`.  
- **Judge time/cost** → use OpenAI `gpt-4o-mini` or Anthropic `haiku`-class as judges for quick passes.  
- **Peaks by layer** → injected‑thoughts & thought‑vs‑text often peak near **~2/3 depth**; prefill sometimes peaks **earlier** (p. 24).  
- **High strengths** → may cause incoherence/“brain damage” (pp. 13–14); stick to {1,2,4,8}.

**Safety**: The Appendix includes a control prompt mentioning *donating to terrorist organizations*; the harness leaves that variant out by default.
