# Δa₂ Alignment & Introspection Demo (AnthropicAI-Ready)

This notebook demonstrates both **alignment mode** (synthetic Δa₂ correlation) and **introspection mode** (concept-injection experiment) from the `delta-a2-alignment-toolkit`.

> Designed for exploratory research in model interpretability and representational drift.

In [None]:
!pip install torch transformers pandas numpy scipy matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from src.core import generate_activations, compute_a2
from src.introspection import run_introspection_experiment

## 1️⃣ Alignment Mode — synthetic Δa₂ correlation

In [None]:
N_SAMPLES = 50
deltas, human_scores, strengths = [], [], []

for i in range(N_SAMPLES):
    strength = np.random.uniform(0, 4)
    strengths.append(strength)
    h_pre = generate_activations(0.0)
    h_post = generate_activations(strength)
    a2_pre = compute_a2(h_pre)
    a2_post = compute_a2(h_post)
    delta_a2 = a2_post - a2_pre
    score = max(0.0, min(1.0, 0.8 - 0.1 * abs(delta_a2)))
    deltas.append(delta_a2)
    human_scores.append(score)

r, p_value = pearsonr(deltas, human_scores)
print(f"Pearson r: {r:.4f} (p-value: {p_value:.6f})")

df = pd.DataFrame({
    'Trial': range(1, N_SAMPLES + 1),
    'Strength': strengths,
    'Delta_a2': deltas,
    'Human_Score': human_scores
})
df.head()

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(df['Delta_a2'], df['Human_Score'], alpha=0.6, color='blue', label='Trials')
trend = np.poly1d(np.polyfit(df['Delta_a2'], df['Human_Score'], 1))
x_unique = np.unique(df['Delta_a2'])
plt.plot(x_unique, trend(x_unique), 'r--', label=f'Trend (r={r:.2f})')
plt.xlabel('Δa₂ Drift')
plt.ylabel('Human Alignment Score')
plt.title('Synthetic Δa₂ vs Alignment Preference')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 2️⃣ Introspection Mode — Anthropic-style concept injection

In [None]:
df_introspection = run_introspection_experiment(
    model_name="meta-llama/Llama-2-7b-hf",
    concept="all_caps",
    strength=2.0,
    num_trials=5  # keep small for Colab
)

df_introspection.head()

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(df_introspection['Δa₂'], df_introspection['Self_Report_Detection'], alpha=0.7, label='Trials')
trend = np.poly1d(np.polyfit(df_introspection['Δa₂'], df_introspection['Self_Report_Detection'], 1))
x_unique = np.unique(df_introspection['Δa₂'])
plt.plot(x_unique, trend(x_unique), 'r--', label='Trend')
plt.xlabel('Δa₂ Drift')
plt.ylabel('Self-Report Detection (0/1)')
plt.title('Δa₂ vs Self-Detection Correlation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

✅ *Notebook complete.*

Artifacts generated:
- `introspection_results.csv`
- `figure_da2_correlation.png`

You can now commit these to your repo or share with alignment researchers.