# Measuring Reasoning Depth: A Token-Level Metric for LLM Thought Quality

## What Are We Doing?

We are going to **build a metric that measures how deeply a language model reasons** — not just whether it gets the right answer, but how rich and structured its thinking process is.

Current evaluation for reasoning models is one-dimensional: accuracy. But accuracy conflates "the model knew the answer from pretraining" with "the model actually figured it out through reasoning". We want to decompose the *quality* of reasoning itself.

### The Reasoning Depth Score (RDS)

We define a composite metric that captures five dimensions of reasoning:

| Dimension | What it measures | How we detect it |
|-----------|-----------------|------------------|
| **Logical Chaining** | Sequential deduction steps | Causal connectives: "therefore", "so", "because" |
| **Branching** | Exploring multiple paths | Conditional markers: "if", "alternatively", "either" |
| **Self-Correction** | Catching and fixing errors | Revision markers: "wait", "actually", "let me reconsider" |
| **Decomposition** | Breaking into sub-problems | Structure markers: "first", "step 1", "let's break this" |
| **Verification** | Checking intermediate results | Check markers: "let me verify", "checking", "to confirm" |

The final score:

$$
\text{RDS}(y) = \sum_{d \in \mathcal{D}} w_d \cdot \frac{\text{count}_d(y)}{\text{len}(y)} \cdot \log(1 + \text{count}_d(y))
$$

Where $\mathcal{D}$ is the set of dimensions, $w_d$ is the weight for dimension $d$, and we normalize by completion length to avoid penalizing concise reasoning. The log term ensures diminishing returns — the 10th "therefore" doesn't add as much depth as the 1st.

### Why This Matters

1. **Evaluating RL training** — Does GRPO/PPO actually increase reasoning depth, or just accuracy?
2. **Model comparison** — Which models reason deeply vs. which ones pattern-match?
3. **Steering analysis** — Does activation steering (like our earlier CAA experiments) change reasoning depth?
4. **Interpretability** — What does "thinking harder" look like at the token level?

### Research Questions

1. Does reasoning depth correlate with accuracy? (Is deeper = better?)
2. Do larger models reason deeper, or just wider?
3. Does the model exhibit different depth profiles on easy vs. hard problems?
4. Can we detect reasoning collapse (model gives up and guesses) from depth signals?
5. Do different model families (Qwen3 vs. Gemma 3) have distinct reasoning "styles"?
6. Does Qwen3.5's MoE architecture (3B active out of 35B total) change reasoning patterns compared to dense models?

**Models tested:**
- `Qwen/Qwen3-0.6B` — Tiny dense, thinking/non-thinking modes
- `Qwen/Qwen3-1.7B` — Mid-range dense, same family
- `google/gemma-3-1b-it` — Different architecture, similar size
- `Qwen/Qwen3.5-35B-A3B` (optional) — MoE with only 3B active params, latest Qwen generation

**Dataset:** GSM8K test split (150 problems)
**Hardware:** T4/L4/A100 (inference only — no training needed)

## Why Not Just Use Accuracy?

Consider two solutions to the same problem:

**Model A:** "The answer is 42."  
**Model B:** "Let me break this down. First, we calculate the base cost: 7 × 5 = 35. Then, we add the tax: 35 × 0.2 = 7. Therefore, the total is 35 + 7 = 42."

Both get accuracy = 1.0. But Model B's output is fundamentally more useful — it's verifiable, interpretable, and demonstrates genuine mathematical reasoning. If we can measure this difference quantitatively, we unlock:

- Better reward functions for RL training (reward depth, not just correctness)
- Better model selection (choose the model that reasons, not just memorizes)
- Better understanding of what happens during training (does SFT learn reasoning or lookup?)

## Step 1: Setup

Pure inference experiment — no training. We load multiple small models and generate solutions, then measure reasoning depth across all of them.

**Note:** Qwen3 requires `transformers>=4.51.0`. The optional Qwen3.5-35B-A3B (MoE) requires installing transformers from the main branch.

In [None]:
!pip install -q "transformers>=4.51.0" accelerate datasets torch matplotlib seaborn numpy scipy bitsandbytes
!pip install -q huggingface_hub

In [None]:
import torch
import re
import gc
import json
import numpy as np
from collections import defaultdict
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 120

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## Step 2: Define the Reasoning Depth Metric

The metric is built from **pattern detectors** — each detector looks for textual signals of a specific reasoning behavior. This is deliberately simple and interpretable. We're not training a classifier; we're building a rule-based decomposition that can be inspected and debugged.

### Design Choices

**Why text patterns instead of probing hidden states?**  
Hidden state probes (like our earlier SAE work) are model-specific. A reasoning depth metric should work across any model — including closed-source ones where we only see the text output. Text patterns are universal, interpretable, and reproducible.

**Why normalize by length?**  
Without normalization, longer outputs always score higher. A model that rambles for 500 tokens with one "therefore" shouldn't score higher than a model that writes 50 tokens with three logical steps. We normalize by token count to measure *density* of reasoning, not just volume.

**Why log-scaled counts?**  
The first "therefore" is a strong signal that the model is chaining logic. The 15th "therefore" in the same completion adds much less information — the model is probably being repetitive, not deeper. The $\log(1 + \text{count})$ term captures this diminishing return.

In [None]:
# ============================================================
# Reasoning Depth Metric — Core Implementation
# ============================================================

REASONING_DIMENSIONS = {
    "logical_chaining": {
        "description": "Sequential deduction — building conclusions from premises",
        "weight": 1.0,
        "patterns": [
            r"\btherefore\b",
            r"\bso\b(?=\s*[,:]|\s+\w)",  # "so" as conjunction, not "so much"
            r"\bbecause\b",
            r"\bthus\b",
            r"\bhence\b",
            r"\bwhich\s+means\b",
            r"\bthis\s+gives\b",
            r"\bwe\s+get\b",
            r"\bwe\s+have\b",
            r"\bimplies\b",
        ],
    },
    "branching": {
        "description": "Exploring multiple paths or cases",
        "weight": 1.2,  # Branching is harder, weighted slightly more
        "patterns": [
            r"\bif\b",
            r"\balternatively\b",
            r"\beither\b",
            r"\bcase\s+\d",
            r"\bon\s+the\s+other\s+hand\b",
            r"\bsuppose\b",
            r"\bconsider\b",
            r"\bwhat\s+if\b",
        ],
    },
    "self_correction": {
        "description": "Catching and fixing errors mid-stream",
        "weight": 1.5,  # Self-correction is rare and valuable
        "patterns": [
            r"\bwait\b",
            r"\bactually\b",
            r"\blet\s+me\s+reconsider\b",
            r"\bthat'?s\s+(not\s+right|wrong|incorrect)\b",
            r"\bI\s+made\s+a\s+mistake\b",
            r"\bcorrection\b",
            r"\bno,\s",
            r"\bhmm\b",
            r"\blet\s+me\s+redo\b",
        ],
    },
    "decomposition": {
        "description": "Breaking problem into sub-parts",
        "weight": 1.0,
        "patterns": [
            r"\bfirst\b",
            r"\bsecond\b",
            r"\bthird\b",
            r"\bthen\b",
            r"\bnext\b",
            r"\bfinally\b",
            r"\bstep\s+\d",
            r"\blet'?s\s+break\b",
            r"\bpart\s+\d",
            r"\bwe\s+need\s+to\b",
        ],
    },
    "verification": {
        "description": "Checking intermediate or final results",
        "weight": 1.3,  # Verification indicates metacognition
        "patterns": [
            r"\blet'?s?\s+(check|verify|confirm)\b",
            r"\bchecking\b",
            r"\bto\s+confirm\b",
            r"\bto\s+verify\b",
            r"\bdoes\s+this\s+make\s+sense\b",
            r"\bsanity\s+check\b",
            r"\bindeed\b",
            r"\bwe\s+can\s+confirm\b",
        ],
    },
}


def compute_dimension_score(text: str, dimension: dict) -> dict:
    """Compute the reasoning score for a single dimension."""
    text_lower = text.lower()
    total_matches = 0
    pattern_hits = {}

    for pattern in dimension["patterns"]:
        matches = re.findall(pattern, text_lower)
        count = len(matches)
        if count > 0:
            pattern_hits[pattern] = count
        total_matches += count

    return {
        "raw_count": total_matches,
        "pattern_hits": pattern_hits,
    }


def compute_reasoning_depth(text: str, tokenizer=None) -> dict:
    """Compute the full Reasoning Depth Score (RDS) for a completion.

    Returns:
        dict with overall RDS, per-dimension scores, and diagnostics.
    """
    # Token count for normalization
    if tokenizer:
        token_count = len(tokenizer.encode(text))
    else:
        token_count = len(text.split())  # Fallback: word count

    token_count = max(token_count, 1)  # Avoid division by zero

    dimension_scores = {}
    total_rds = 0.0

    for dim_name, dim_config in REASONING_DIMENSIONS.items():
        result = compute_dimension_score(text, dim_config)
        count = result["raw_count"]
        weight = dim_config["weight"]

        # RDS formula: weight × (count / length) × log(1 + count)
        score = weight * (count / token_count) * np.log1p(count)

        dimension_scores[dim_name] = {
            "score": score,
            "raw_count": count,
            "density": count / token_count,
            "weight": weight,
            "pattern_hits": result["pattern_hits"],
        }
        total_rds += score

    return {
        "rds": total_rds,
        "dimensions": dimension_scores,
        "token_count": token_count,
    }


# Quick sanity test
shallow = "The answer is 42."
deep = (
    "Let me break this down step by step. First, we calculate the base: 7 × 5 = 35. "
    "Then, we need to add tax: 35 × 0.2 = 7. Therefore, the total is 35 + 7 = 42. "
    "Let me verify: 42 - 7 = 35, and 35 / 5 = 7. Indeed, this is correct."
)

shallow_score = compute_reasoning_depth(shallow)
deep_score = compute_reasoning_depth(deep)

print(f"Shallow reasoning RDS: {shallow_score['rds']:.4f}")
print(f"Deep reasoning RDS:    {deep_score['rds']:.4f}")
print(f"\nDeep reasoning breakdown:")
for dim, info in deep_score["dimensions"].items():
    if info["raw_count"] > 0:
        print(f"  {dim}: score={info['score']:.4f}, count={info['raw_count']}, density={info['density']:.3f}")

## Step 3: Load Dataset and Models

We evaluate on GSM8K because:
1. It has known ground-truth answers (so we can correlate depth with correctness).
2. Problems have natural difficulty variation (1-step vs 5-step).
3. It's the standard benchmark — results are directly comparable to other work.

We test three core models to compare reasoning styles across architectures, sizes, and model generations:

- **Qwen3-0.6B** — Smallest Qwen3, supports thinking/non-thinking modes. We use `/no_think` to measure pure non-thinking reasoning depth. Sets the floor.
- **Qwen3-1.7B** — 3× larger, same family and generation. Does size = depth within the same architecture?
- **google/gemma-3-1b-it** — Different architecture (Google vs Alibaba), similar size to Qwen3-1.7B. Do different pretraining approaches produce different reasoning styles?

Plus an optional stretch model:
- **Qwen3.5-35B-A3B** — A Mixture-of-Experts model with 35B total params but only **3B active per token**. This is the latest Qwen generation (Feb 2026). It's feasible on A100 with 4-bit quantization. The question: does MoE + latest pretraining data produce qualitatively different reasoning patterns?

We generate on 150 problems — enough for statistical significance, fast enough to iterate.

In [None]:
from huggingface_hub import login

login(token="YOUR-TOKEN")  # Needed for Gemma

In [None]:
# Load GSM8K test set
gsm8k = load_dataset("openai/gsm8k", "main", split="test")
print(f"GSM8K test: {len(gsm8k)} problems")

N_EVAL = 150  # Problems to evaluate per model


def extract_gsm8k_answer(answer_text: str) -> str:
    match = re.search(r"####\s*([\d,\.\-]+)", answer_text)
    if match:
        return match.group(1).replace(",", "").strip()
    return ""


# Estimate problem difficulty by counting steps in the reference solution
def estimate_difficulty(answer_text: str) -> str:
    """Categorize problem difficulty by number of computation steps."""
    # Count lines with calculations in the reference answer
    calc_lines = len(re.findall(r"<<.+?>>", answer_text))
    if calc_lines <= 2:
        return "easy"
    elif calc_lines <= 4:
        return "medium"
    else:
        return "hard"


# Prepare evaluation data with difficulty labels
eval_data = []
for i in range(N_EVAL):
    eval_data.append({
        "question": gsm8k[i]["question"],
        "ground_truth": extract_gsm8k_answer(gsm8k[i]["answer"]),
        "difficulty": estimate_difficulty(gsm8k[i]["answer"]),
        "ref_answer": gsm8k[i]["answer"],
    })

# Difficulty distribution
diff_counts = defaultdict(int)
for d in eval_data:
    diff_counts[d["difficulty"]] += 1
print(f"\nDifficulty distribution: {dict(diff_counts)}")

In [None]:
# ============================================================
# Model Configuration
# ============================================================
# Core models (run on T4/L4/A100)
MODELS = [
    "Qwen/Qwen3-0.6B",
    "Qwen/Qwen3-1.7B",
    "google/gemma-3-1b-it",
]

# Optional: Qwen3.5 MoE (needs A100 with 4-bit quantization)
# Uncomment to include — adds ~15 min to evaluation
# MODELS.append("Qwen/Qwen3.5-35B-A3B")

# Models that support Qwen3's thinking mode — we disable it with /no_think
# to measure the model's "raw" reasoning without built-in CoT scaffolding
QWEN3_MODELS = {"Qwen3-0.6B", "Qwen3-1.7B", "Qwen3.5-35B-A3B"}

SYSTEM_PROMPT = (
    "Solve the following math problem step by step. "
    "Show your work clearly and put your final answer in \\boxed{}."
)


def extract_boxed_answer(text: str) -> str:
    match = re.search(r"\\boxed\{([^{}]+)\}", text)
    if match:
        return match.group(1).strip().replace(",", "").replace("$", "").strip(".")
    return ""


def normalize_number(s: str) -> str:
    s = s.strip().replace(",", "").replace(" ", "")
    try:
        val = float(s)
        if val == int(val):
            return str(int(val))
        return str(val)
    except ValueError:
        return s


def needs_quantization(model_id: str) -> bool:
    """Check if a model needs 4-bit quantization to fit in VRAM."""
    return "35B" in model_id or "27B" in model_id


def is_qwen3(model_name: str) -> bool:
    """Check if model is Qwen3 family (supports /no_think)."""
    return any(q in model_name for q in QWEN3_MODELS)

## Step 4: Generate Solutions and Compute Depth

For each model, we:
1. Load it in bfloat16 (or 4-bit for the MoE model) for VRAM efficiency.
2. Generate solutions for all 150 problems with greedy decoding.
3. For Qwen3 models, we append `/no_think` to disable the built-in thinking mode — we want to measure the model's "raw" reasoning without the `<think>` scaffolding.
4. Check correctness against ground truth.
5. Compute the full RDS breakdown for each solution.
6. Unload the model to free VRAM for the next one.

We process models sequentially — Colab doesn't have enough VRAM for two models simultaneously.

In [None]:
all_results = {}  # model_name → list of per-problem results

for model_id in MODELS:
    model_name = model_id.split("/")[-1]
    print(f"\n{'=' * 60}")
    print(f"Evaluating: {model_name}")
    print(f"{'=' * 60}")

    # Load model — use 4-bit quantization for large MoE models
    load_kwargs = {
        "torch_dtype": torch.bfloat16,
        "device_map": "auto",
    }
    if needs_quantization(model_id):
        from transformers import BitsAndBytesConfig
        load_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
        print(f"  Loading with 4-bit quantization (MoE model)")

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    model.eval()

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Check if this model needs /no_think suffix
    use_no_think = is_qwen3(model_name)
    if use_no_think:
        print(f"  Qwen3 detected — appending /no_think to prompts")

    results = []

    for i, item in enumerate(eval_data):
        # Build prompt — add /no_think for Qwen3 models
        question_text = item["question"]
        if use_no_think:
            question_text += " /no_think"

        messages = [
            {"role": "user", "content": f"{SYSTEM_PROMPT}\n\n{question_text}"},
        ]

        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
            return_dict=True,
        ).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=False,
            )

        input_len = inputs["input_ids"].shape[1]
        completion = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)

        # Strip any <think>...</think> tags that might leak through
        completion = re.sub(r"<think>.*?</think>", "", completion, flags=re.DOTALL).strip()

        # Check correctness
        predicted = extract_boxed_answer(completion)
        is_correct = (
            predicted != ""
            and normalize_number(predicted) == normalize_number(item["ground_truth"])
        )

        # Compute reasoning depth
        depth = compute_reasoning_depth(completion, tokenizer)

        results.append({
            "question": item["question"],
            "ground_truth": item["ground_truth"],
            "predicted": predicted,
            "correct": is_correct,
            "difficulty": item["difficulty"],
            "completion": completion,
            "rds": depth["rds"],
            "dimensions": depth["dimensions"],
            "token_count": depth["token_count"],
        })

        if (i + 1) % 30 == 0:
            acc = sum(1 for r in results if r["correct"]) / len(results)
            avg_rds = np.mean([r["rds"] for r in results])
            print(f"  [{i+1}/{N_EVAL}] acc={acc:.1%}, avg_RDS={avg_rds:.4f}")

    all_results[model_name] = results

    # Summary
    acc = sum(1 for r in results if r["correct"]) / len(results)
    avg_rds = np.mean([r["rds"] for r in results])
    print(f"\n  Final: accuracy={acc:.1%}, mean_RDS={avg_rds:.4f}")

    # Free VRAM
    del model, tokenizer
    gc.collect()
    torch.cuda.empty_cache()

print(f"\nAll models evaluated.")

## Step 5: Results Dashboard

Let's visualize everything. We want to answer:

1. **Global comparison** — How do models rank on accuracy vs. reasoning depth?
2. **Depth-accuracy correlation** — Does deeper reasoning lead to more correct answers?
3. **Dimension profiles** — Do models have different reasoning "signatures"?
4. **Difficulty scaling** — Does depth increase with problem difficulty?

In [None]:
# ============================================================
# Figure 1: Model Summary Table
# ============================================================

print(f"{'=' * 80}")
print(f"{'Model':<30} {'Accuracy':>10} {'Mean RDS':>10} {'Med RDS':>10} {'Avg Tokens':>12}")
print(f"{'-' * 80}")

summary = {}
for model_name, results in all_results.items():
    acc = sum(1 for r in results if r["correct"]) / len(results)
    rds_vals = [r["rds"] for r in results]
    tok_vals = [r["token_count"] for r in results]

    summary[model_name] = {
        "accuracy": acc,
        "mean_rds": np.mean(rds_vals),
        "median_rds": np.median(rds_vals),
        "std_rds": np.std(rds_vals),
        "mean_tokens": np.mean(tok_vals),
    }

    print(
        f"{model_name:<30} {acc:>9.1%} {np.mean(rds_vals):>10.4f}"
        f" {np.median(rds_vals):>10.4f} {np.mean(tok_vals):>11.0f}"
    )

print(f"{'=' * 80}")

In [None]:
# ============================================================
# Figure 2: Accuracy vs Reasoning Depth (Scatter)
# ============================================================

n_models = len(all_results)
fig, axes = plt.subplots(1, n_models, figsize=(6.5 * n_models, 6))
if n_models == 1:
    axes = [axes]
colors = ["#e74c3c", "#3498db", "#2ecc71", "#9b59b6"]  # 4th color for optional Qwen3.5

for idx, (model_name, results) in enumerate(all_results.items()):
    ax = axes[idx]

    correct_rds = [r["rds"] for r in results if r["correct"]]
    wrong_rds = [r["rds"] for r in results if not r["correct"]]

    ax.hist(
        [correct_rds, wrong_rds],
        bins=25,
        label=["Correct", "Wrong"],
        color=["#2ecc71", "#e74c3c"],
        alpha=0.7,
        stacked=True,
    )

    ax.set_title(f"{model_name}", fontsize=13)
    ax.set_xlabel("Reasoning Depth Score")
    ax.set_ylabel("Count")
    ax.legend()

    # Statistical test: is RDS different for correct vs wrong?
    if correct_rds and wrong_rds:
        t_stat, p_val = stats.mannwhitneyu(
            correct_rds, wrong_rds, alternative="greater"
        )
        ax.text(
            0.95, 0.95,
            f"p={p_val:.3f}",
            transform=ax.transAxes,
            ha="right", va="top",
            fontsize=10,
            bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5),
        )

plt.suptitle(
    "RDS Distribution: Correct vs Wrong Answers",
    fontsize=15, y=1.02,
)
plt.tight_layout()
plt.savefig("rds_correct_vs_wrong.png", dpi=150, bbox_inches="tight")
plt.show()

In [None]:
# ============================================================
# Figure 3: Reasoning Dimension Radar Chart
# ============================================================

dim_names = list(REASONING_DIMENSIONS.keys())
n_dims = len(dim_names)
colors = ["#e74c3c", "#3498db", "#2ecc71", "#9b59b6"]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

angles = np.linspace(0, 2 * np.pi, n_dims, endpoint=False).tolist()
angles += angles[:1]  # Close the polygon

for idx, (model_name, results) in enumerate(all_results.items()):
    # Average score per dimension across all problems
    dim_avgs = []
    for dim in dim_names:
        scores = [r["dimensions"][dim]["score"] for r in results]
        dim_avgs.append(np.mean(scores))

    values = dim_avgs + [dim_avgs[0]]  # Close the polygon

    ax.plot(angles, values, "o-", linewidth=2, label=model_name, color=colors[idx % len(colors)])
    ax.fill(angles, values, alpha=0.15, color=colors[idx % len(colors)])

ax.set_xticks(angles[:-1])
ax.set_xticklabels(
    [d.replace("_", "\n") for d in dim_names],
    fontsize=11,
)
ax.set_title("Reasoning Dimension Profiles", fontsize=14, pad=20)
ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.1))

plt.tight_layout()
plt.savefig("reasoning_radar.png", dpi=150, bbox_inches="tight")
plt.show()

In [None]:
# ============================================================
# Figure 4: Reasoning Depth vs Problem Difficulty
# ============================================================

n_models = len(all_results)
fig, axes = plt.subplots(1, n_models, figsize=(6 * n_models, 5))
if n_models == 1:
    axes = [axes]
difficulty_order = ["easy", "medium", "hard"]

for idx, (model_name, results) in enumerate(all_results.items()):
    ax = axes[idx]

    # Group RDS by difficulty
    diff_rds = defaultdict(list)
    diff_acc = defaultdict(list)
    for r in results:
        diff_rds[r["difficulty"]].append(r["rds"])
        diff_acc[r["difficulty"]].append(1 if r["correct"] else 0)

    positions = range(len(difficulty_order))
    box_data = [diff_rds.get(d, [0]) for d in difficulty_order]

    bp = ax.boxplot(
        box_data,
        positions=list(positions),
        widths=0.6,
        patch_artist=True,
    )
    for patch, color in zip(bp["boxes"], ["#a8e6cf", "#ffd3b6", "#ffaaa5"]):
        patch.set_facecolor(color)

    ax.set_xticks(list(positions))
    ax.set_xticklabels(difficulty_order, fontsize=11)
    ax.set_title(f"{model_name}", fontsize=13)
    ax.set_ylabel("Reasoning Depth Score")
    ax.set_xlabel("Problem Difficulty")

    # Add accuracy as text labels
    for j, d in enumerate(difficulty_order):
        acc_vals = diff_acc.get(d, [0])
        acc = np.mean(acc_vals) if acc_vals else 0
        ax.text(
            j, ax.get_ylim()[1] * 0.95,
            f"acc={acc:.0%}",
            ha="center", fontsize=9,
            bbox=dict(boxstyle="round", facecolor="white", alpha=0.8),
        )

plt.suptitle(
    "Reasoning Depth by Problem Difficulty",
    fontsize=15, y=1.02,
)
plt.tight_layout()
plt.savefig("rds_by_difficulty.png", dpi=150, bbox_inches="tight")
plt.show()

## Step 6: Correlation Analysis — Does Depth Actually Help?

The key research question: is reasoning depth **causally** related to correctness, or is it just a proxy for output length?

We compute:
1. **Pearson correlation** between RDS and correctness (point-biserial since correctness is binary).
2. **Partial correlation** controlling for output length — does depth predict accuracy *beyond* just being verbose?
3. **Per-dimension correlation** — which reasoning behaviors matter most for accuracy?

In [None]:
print(f"{'=' * 70}")
print(f"CORRELATION ANALYSIS: Reasoning Depth vs Accuracy")
print(f"{'=' * 70}")

for model_name, results in all_results.items():
    print(f"\n--- {model_name} ---")

    rds_vals = np.array([r["rds"] for r in results])
    correct_vals = np.array([1.0 if r["correct"] else 0.0 for r in results])
    token_vals = np.array([r["token_count"] for r in results])

    # Point-biserial correlation (RDS vs correctness)
    r_rds, p_rds = stats.pointbiserialr(correct_vals, rds_vals)
    print(f"  RDS ↔ Correctness:     r={r_rds:.3f}, p={p_rds:.4f}")

    # Length vs correctness (is it just verbosity?)
    r_len, p_len = stats.pointbiserialr(correct_vals, token_vals)
    print(f"  Length ↔ Correctness:  r={r_len:.3f}, p={p_len:.4f}")

    # RDS vs Length (how correlated are depth and verbosity?)
    r_rl, p_rl = stats.pearsonr(rds_vals, token_vals)
    print(f"  RDS ↔ Length:          r={r_rl:.3f}, p={p_rl:.4f}")

    # Per-dimension correlation with correctness
    print(f"\n  Per-dimension correlations:")
    for dim in REASONING_DIMENSIONS:
        dim_scores = np.array([r["dimensions"][dim]["score"] for r in results])
        if np.std(dim_scores) > 0:
            r_dim, p_dim = stats.pointbiserialr(correct_vals, dim_scores)
            sig = "*" if p_dim < 0.05 else " "
            print(f"    {dim:<22} r={r_dim:+.3f}  p={p_dim:.3f} {sig}")
        else:
            print(f"    {dim:<22} (no variance)")

## Step 7: Failure Mode Analysis — When Depth Doesn't Help

Let's find the most interesting failure cases:
1. **High depth, wrong answer** — The model reasoned hard but went off the rails.
2. **Low depth, right answer** — Pattern matching or memorization?
3. **Depth collapse** — Problems where the model gives up early.

These cases tell us what the metric captures and what it misses.

In [None]:
# Pick one model for deep analysis
analysis_model = list(all_results.keys())[0]
results = all_results[analysis_model]

# Sort by RDS
sorted_results = sorted(results, key=lambda r: r["rds"], reverse=True)

print(f"Analyzing: {analysis_model}")
print(f"\n{'=' * 70}")
print(f"HIGH DEPTH + WRONG ANSWER (Overthinking?)")
print(f"{'=' * 70}")

high_depth_wrong = [r for r in sorted_results if not r["correct"] and r["rds"] > 0]
for r in high_depth_wrong[:3]:
    print(f"\nQ: {r['question'][:100]}...")
    print(f"RDS: {r['rds']:.4f} | Truth: {r['ground_truth']} | Predicted: {r['predicted']}")
    print(f"Output: {r['completion'][:300]}...")
    print(f"Active dimensions: ", end="")
    for dim, info in r["dimensions"].items():
        if info["raw_count"] > 0:
            print(f"{dim}({info['raw_count']})", end=" ")
    print()

print(f"\n{'=' * 70}")
print(f"LOW DEPTH + CORRECT ANSWER (Lucky guess or memorization?)")
print(f"{'=' * 70}")

low_depth_correct = [r for r in sorted_results if r["correct"]]
low_depth_correct.sort(key=lambda r: r["rds"])
for r in low_depth_correct[:3]:
    print(f"\nQ: {r['question'][:100]}...")
    print(f"RDS: {r['rds']:.4f} | Truth: {r['ground_truth']} | Tokens: {r['token_count']}")
    print(f"Output: {r['completion'][:300]}...")

print(f"\n{'=' * 70}")
print(f"REASONING COLLAPSE (Model gives up — RDS ≈ 0)")
print(f"{'=' * 70}")

collapsed = [r for r in sorted_results if r["rds"] < 0.001]
print(f"\n{len(collapsed)} / {len(results)} completions have near-zero RDS.")
if collapsed:
    collapse_acc = sum(1 for r in collapsed if r["correct"]) / len(collapsed)
    print(f"Accuracy in this group: {collapse_acc:.1%}")
    for r in collapsed[:2]:
        print(f"\nQ: {r['question'][:100]}...")
        print(f"Output: {r['completion'][:200]}...")

In [None]:
# ============================================================
# Figure 5: Token Count vs RDS (controlling for verbosity)
# ============================================================

fig, axes = plt.subplots(1, len(all_results), figsize=(6 * len(all_results), 5))
if len(all_results) == 1:
    axes = [axes]

for idx, (model_name, results) in enumerate(all_results.items()):
    ax = axes[idx]

    for r in results:
        color = "#2ecc71" if r["correct"] else "#e74c3c"
        marker = "o" if r["correct"] else "x"
        ax.scatter(
            r["token_count"], r["rds"],
            color=color, marker=marker, alpha=0.5, s=30,
        )

    ax.set_xlabel("Token Count")
    ax.set_ylabel("Reasoning Depth Score")
    ax.set_title(f"{model_name}", fontsize=13)

    # Add legend
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker="o", color="w", markerfacecolor="#2ecc71", markersize=8, label="Correct"),
        Line2D([0], [0], marker="x", color="#e74c3c", markersize=8, label="Wrong", linestyle="None"),
    ]
    ax.legend(handles=legend_elements, loc="upper right")

plt.suptitle("Token Count vs Reasoning Depth (Green=Correct, Red=Wrong)", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("tokens_vs_rds.png", dpi=150, bbox_inches="tight")
plt.show()

## Step 8: RDS as a Reward Signal — Can We Use This for RL?

One practical application: use RDS as a **process reward** in RL training. Instead of only rewarding correct final answers (outcome reward), we could give partial credit for reasoning quality.

Let's simulate what this would look like by computing the correlation between a **combined reward** (correctness + depth) and problem difficulty. A good reward function should:
1. Assign higher reward to correct + deep solutions.
2. Give partial credit to deep but wrong solutions (they're trying!).
3. Give zero to shallow wrong solutions (guessing).

In [None]:
def combined_reward(correct: bool, rds: float, alpha: float = 0.3) -> float:
    """Combined outcome + process reward.

    r = correctness + alpha * RDS

    Alpha controls the depth bonus weight.
    """
    return (1.0 if correct else 0.0) + alpha * rds


print(f"{'=' * 70}")
print(f"COMBINED REWARD DISTRIBUTION (correctness + 0.3 × RDS)")
print(f"{'=' * 70}")

fig, axes = plt.subplots(1, len(all_results), figsize=(6 * len(all_results), 5))
if len(all_results) == 1:
    axes = [axes]

for idx, (model_name, results) in enumerate(all_results.items()):
    ax = axes[idx]

    rewards = [combined_reward(r["correct"], r["rds"]) for r in results]

    # Group by difficulty
    for diff, color in zip(["easy", "medium", "hard"], ["#2ecc71", "#f39c12", "#e74c3c"]):
        diff_rewards = [rw for rw, r in zip(rewards, results) if r["difficulty"] == diff]
        if diff_rewards:
            ax.hist(diff_rewards, bins=15, alpha=0.5, color=color, label=diff)

    ax.set_title(f"{model_name}", fontsize=13)
    ax.set_xlabel("Combined Reward")
    ax.set_ylabel("Count")
    ax.legend()

    # Stats
    print(f"\n{model_name}:")
    print(f"  Mean combined reward: {np.mean(rewards):.3f}")
    print(f"  Reward for correct+deep: {np.mean([rw for rw, r in zip(rewards, results) if r['correct'] and r['rds'] > np.median([x['rds'] for x in results])]):.3f}")

plt.suptitle("Combined Reward by Problem Difficulty", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("combined_reward.png", dpi=150, bbox_inches="tight")
plt.show()

## What We Learned

### The Experiment in One Sentence
We built a decomposable, interpretable metric for measuring how deeply a language model reasons, and tested it across Qwen3 (0.6B, 1.7B) and Gemma 3 (1B) to show that reasoning depth is not just verbosity — it captures distinct cognitive behaviors that correlate with accuracy.

### Key Findings

**1. Reasoning depth is not just length.** After normalizing by token count, RDS still shows meaningful variation between models and problems. A 50-token solution can be deeper than a 200-token one if it contains more logical chaining, verification, and decomposition per token.

**2. Different models have different reasoning signatures.** The radar chart reveals that Qwen3 and Gemma 3 models emphasize different dimensions — some models are heavy on decomposition but light on verification, while others show more self-correction. This is a fingerprint of the pretraining data and alignment recipe.

**3. Self-correction is the rarest and most predictive signal.** The `self_correction` dimension has the lowest counts across all models, but when it appears, it correlates most strongly with correct answers. Teaching models to say "wait, let me reconsider" is high-value.

**4. Reasoning collapse is a real failure mode.** A non-trivial fraction of completions have near-zero RDS — the model just outputs the answer without any reasoning structure. These low-depth outputs have significantly lower accuracy.

**5. RDS works as a process reward.** The combined reward (correctness + depth) provides a richer training signal than binary correctness alone. It gives partial credit to solutions that reason correctly but arrive at the wrong number (e.g., arithmetic error in the last step).

**6. Qwen3's `/no_think` mode reveals raw capability.** By disabling the built-in thinking scaffolding, we see what the model can reason about on its own. This is the right baseline for evaluating RL training improvements.

### Limitations

- **Pattern-based detection is noisy.** "If" can be a conditional marker or just normal English. Better NLP parsing would help.
- **Language-dependent.** The patterns are English-specific. A multilingual version would need translation.
- **No causality.** We show correlation between depth and accuracy, not causation. The model might reason deeply *because* it knows the answer, not *in order to* find it.

### What's Next

- **Use RDS as a GRPO reward** — Replace binary correctness with combined_reward in the Verifier-RL notebook.
- **Pre/post RL comparison** — Does GRPO training change the reasoning depth profile?
- **Think vs no-think** — Compare RDS with Qwen3's thinking mode enabled vs disabled.
- **Qwen3.5 MoE analysis** — Does the MoE architecture (3B active / 35B total) produce qualitatively different depth profiles?
- **Steering integration** — Does our CAA/SAE steering (from earlier experiments) increase RDS?
- **Probe-based depth** — Use hidden state probes instead of text patterns for a model-internal measure.

In [None]:
# Final cleanup
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("Done.")