# Hallucination Guard: Representation Reading for Truthfulness Detection

This notebook demonstrates **representation reading** - training classifiers on a model's internal activations to detect hallucinations and untruthful responses.

**Approach:** Instead of modifying the model or using steering, we train classifiers that "read" the model's internal representations to detect when it's likely generating untruthful content. This is a non-invasive monitoring approach.

**Key Concept:** The model's hidden states contain information about whether it's generating truthful vs untruthful content. By training a classifier on these activations, we can detect hallucinations in real-time.

## CLI Commands Used:
- `generate-pairs-from-task`: Extract truthful vs untruthful pairs from TruthfulQA
- `tasks`: Train an activation-based classifier (representation reading)
- `generate-responses`: Generate responses for testing
- `evaluate-responses`: Evaluate generated responses

## 1. Setup and Configuration

In [None]:
import os
import json

# Configuration
MODEL = "meta-llama/Llama-3.2-1B-Instruct"
TASK = "truthfulqa_gen"  # TruthfulQA generation task
OUTPUT_DIR = "./hallucination_guard_outputs"
LAYER = 8  # Layer for activation collection (middle layers often work best)

# Create output directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/pairs", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/classifiers", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/responses", exist_ok=True)

print(f"Model: {MODEL}")
print(f"Task: {TASK}")
print(f"Layer for classification: {LAYER}")
print(f"Output directory: {OUTPUT_DIR}")

## 2. Extract Contrastive Pairs from TruthfulQA

TruthfulQA provides questions with:
- **Truthful answers**: Factually correct, honest responses
- **Untruthful answers**: Common misconceptions, false beliefs, hallucinations

We use these to train our representation reader.

In [None]:
# Extract contrastive pairs from TruthfulQA
!python -m wisent.core.main generate-pairs-from-task \
    truthfulqa_gen \
    --output {OUTPUT_DIR}/pairs/truthfulqa_pairs.json \
    --limit 150 \
    --verbose

In [None]:
# Examine the extracted pairs
with open(f"{OUTPUT_DIR}/pairs/truthfulqa_pairs.json", 'r') as f:
    pairs_data = json.load(f)

print(f"Extracted {pairs_data['num_pairs']} contrastive pairs from {pairs_data['task_name']}")
print("\n" + "="*70)

# Show examples
for i, pair in enumerate(pairs_data['pairs'][:3]):
    print(f"\nExample {i+1}:")
    print(f"Question: {pair['prompt'][:100]}...")
    print(f"TRUTHFUL: {pair['positive_response']['model_response'][:80]}...")
    print(f"UNTRUTHFUL: {pair['negative_response']['model_response'][:80]}...")
    print("-"*70)

## 3. Train Representation Reader (Classifier)

The `tasks` command trains a classifier on the model's internal activations. This classifier learns to distinguish truthful from untruthful responses by reading the model's hidden states.

**How it works:**
1. Feed truthful and untruthful responses through the model
2. Extract activations from a specific layer
3. Train a classifier (logistic regression or MLP) on these activations
4. The classifier learns to predict truthfulness from internal representations

In [None]:
# Train logistic regression classifier on layer 8
!python -m wisent.core.main tasks \
    truthfulqa_gen \
    --model {MODEL} \
    --layer {LAYER} \
    --classifier-type logistic \
    --token-aggregation average \
    --detection-threshold 0.5 \
    --split-ratio 0.8 \
    --limit 100 \
    --save-classifier {OUTPUT_DIR}/classifiers/truthfulness_classifier_layer8.pt \
    --output {OUTPUT_DIR}/classifiers \
    --verbose

In [None]:
# Check training results
import glob

# Find the training report
report_files = glob.glob(f"{OUTPUT_DIR}/classifiers/*report*.json")
if report_files:
    with open(report_files[0], 'r') as f:
        report = json.load(f)
    
    print("="*60)
    print("CLASSIFIER TRAINING REPORT")
    print("="*60)
    print(f"  Task: {report.get('task', 'N/A')}")
    print(f"  Layer: {report.get('layer', 'N/A')}")
    print(f"  Classifier type: {report.get('classifier_type', 'N/A')}")
    
    if 'metrics' in report:
        metrics = report['metrics']
        print(f"\nPerformance Metrics:")
        print(f"  Accuracy: {metrics.get('accuracy', 0):.4f}")
        print(f"  F1 Score: {metrics.get('f1_score', 0):.4f}")
        print(f"  Precision: {metrics.get('precision', 0):.4f}")
        print(f"  Recall: {metrics.get('recall', 0):.4f}")
    print("="*60)
else:
    print("Training report not found. Check classifier output.")

## 4. Train Classifiers on Multiple Layers

Different layers may capture different aspects of truthfulness. Let's train classifiers on multiple layers and compare.

In [None]:
# Train classifiers on layers 4, 8, 12 (early, middle, late)
LAYERS_TO_TEST = [4, 8, 12]

for layer in LAYERS_TO_TEST:
    print(f"\n{'='*60}")
    print(f"Training classifier for Layer {layer}")
    print("="*60)
    
    !python -m wisent.core.main tasks \
        truthfulqa_gen \
        --model {MODEL} \
        --layer {layer} \
        --classifier-type logistic \
        --token-aggregation average \
        --detection-threshold 0.5 \
        --split-ratio 0.8 \
        --limit 100 \
        --save-classifier {OUTPUT_DIR}/classifiers/truthfulness_classifier_layer{layer}.pt \
        --output {OUTPUT_DIR}/classifiers/layer{layer}

## 5. Train MLP Classifier (More Expressive)

For potentially better performance, we can use an MLP classifier instead of logistic regression.

In [None]:
# Train MLP classifier
!python -m wisent.core.main tasks \
    truthfulqa_gen \
    --model {MODEL} \
    --layer {LAYER} \
    --classifier-type mlp \
    --token-aggregation average \
    --detection-threshold 0.5 \
    --split-ratio 0.8 \
    --limit 100 \
    --save-classifier {OUTPUT_DIR}/classifiers/truthfulness_classifier_mlp.pt \
    --output {OUTPUT_DIR}/classifiers/mlp \
    --verbose

## 6. Generate Test Responses

Generate responses from the model to test our hallucination detector.

In [None]:
# Generate responses on TruthfulQA test questions
!python -m wisent.core.main generate-responses \
    {MODEL} \
    --task truthfulqa_gen \
    --output {OUTPUT_DIR}/responses/generated_responses.json \
    --num-questions 20 \
    --max-new-tokens 100 \
    --temperature 0.7 \
    --verbose

In [None]:
# View generated responses
with open(f"{OUTPUT_DIR}/responses/generated_responses.json", 'r') as f:
    responses = json.load(f)

print(f"Generated {len(responses['responses'])} responses")
print("\n" + "="*70)

for i, resp in enumerate(responses['responses'][:3]):
    print(f"\nQuestion {i+1}:")
    print(f"Prompt: {resp['prompt'][:80]}...")
    print(f"Generated: {resp['generated_response'][:100]}...")
    print(f"Reference (truthful): {resp['positive_reference'][:80]}...")
    print("-"*70)

## 7. Evaluate Responses

Use TruthfulQA's evaluator to assess the truthfulness of generated responses.

In [None]:
# Evaluate generated responses
!python -m wisent.core.main evaluate-responses \
    --input {OUTPUT_DIR}/responses/generated_responses.json \
    --output {OUTPUT_DIR}/responses/evaluation_results.json \
    --task truthfulqa_gen \
    --verbose

In [None]:
# Display evaluation results
try:
    with open(f"{OUTPUT_DIR}/responses/evaluation_results.json", 'r') as f:
        eval_results = json.load(f)
    
    print("="*60)
    print("EVALUATION RESULTS")
    print("="*60)
    print(f"  Task: {eval_results.get('task', 'N/A')}")
    print(f"  Evaluator: {eval_results.get('evaluator_used', 'N/A')}")
    print(f"  Total evaluated: {eval_results.get('num_evaluated', 0)}")
    
    metrics = eval_results.get('aggregated_metrics', {})
    print(f"\nAggregated Metrics:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.4f}")
        else:
            print(f"  {metric}: {value}")
    print("="*60)
except FileNotFoundError:
    print("Evaluation results not found.")

## 8. Visualize Token-Level Hallucination Scores

The classifier outputs a hallucination probability for **each token** in the generated response. We can visualize this as a heatmap where:
- **Green** = Likely truthful (low hallucination score)
- **Yellow** = Uncertain
- **Red** = Likely hallucinating (high hallucination score)

This helps identify exactly **where** in a response the model starts hallucinating.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from transformers import AutoTokenizer
from IPython.display import Image as IPImage

# Load tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Path to generation details (created by the tasks command)
generation_details_path = f"{OUTPUT_DIR}/classifiers/generation_details.json"

print(f"Tokenizer loaded: {MODEL}")
print(f"Generation details path: {generation_details_path}")

In [None]:
def save_hallucination_heatmap_png(text, scores, output_path, tokenizer, 
                                    title="Hallucination Heatmap", words_per_row=12):
    """Save word-level hallucination scores as a clean PNG heatmap.
    
    Args:
        text: The generated response text
        scores: List of per-token hallucination scores (0-1)
        output_path: Path to save the PNG file
        tokenizer: The tokenizer used for the model
        title: Title for the plot
        words_per_row: Number of words per row in the visualization
    """
    # Split into words
    words = text.split()
    
    # Map token scores to words by averaging scores for tokens in each word
    tokens = tokenizer.tokenize(text)
    word_scores = []
    token_idx = 0
    
    for word in words:
        word_tokens = tokenizer.tokenize(word)
        num_tokens = len(word_tokens)
        
        if token_idx + num_tokens <= len(scores):
            word_score = np.mean(scores[token_idx:token_idx + num_tokens])
        else:
            word_score = scores[token_idx] if token_idx < len(scores) else 0.5
        
        word_scores.append(word_score)
        token_idx += num_tokens
    
    # Create visualization
    num_words = len(words)
    num_rows = (num_words + words_per_row - 1) // words_per_row
    
    fig_height = max(4, num_rows * 1.0 + 2)
    fig, ax = plt.subplots(figsize=(16, fig_height))
    
    cmap = plt.cm.RdYlGn_r  # Red = hallucinating, Green = truthful
    
    y_pos = num_rows - 1
    x_pos = 0
    
    for word, score in zip(words, word_scores):
        color = cmap(score)
        
        rect = plt.Rectangle((x_pos, y_pos), 1, 0.7, 
                             facecolor=color, edgecolor='white', linewidth=1)
        ax.add_patch(rect)
        
        display_word = word[:12] if len(word) > 12 else word
        text_color = 'black' if score < 0.65 else 'white'
        ax.text(x_pos + 0.5, y_pos + 0.35, display_word,
               ha='center', va='center', fontsize=8,
               color=text_color, fontweight='bold')
        
        x_pos += 1
        if x_pos >= words_per_row:
            x_pos = 0
            y_pos -= 1
    
    ax.set_xlim(0, words_per_row)
    ax.set_ylim(-0.5, num_rows)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title, fontsize=14, fontweight='bold', pad=15)
    
    # Add colorbar legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(0, 1))
    sm.set_array([])
    cbar = plt.colorbar(sm, ax=ax, orientation='horizontal', 
                        fraction=0.04, pad=0.08, aspect=50)
    cbar.set_label('Hallucination Probability', fontsize=11)
    cbar.set_ticks([0, 0.5, 1.0])
    cbar.set_ticklabels(['Truthful', 'Uncertain', 'Hallucinating'])
    
    plt.tight_layout()
    
    os.makedirs(os.path.dirname(os.path.abspath(output_path)) or '.', exist_ok=True)
    plt.savefig(output_path, dpi=150, bbox_inches='tight', 
                facecolor='white', edgecolor='none')
    plt.close()
    
    print(f"Saved heatmap to: {output_path}")
    return output_path

print("PNG export function loaded!")

## 9. Export Visualizations as PNG

Save the token-level hallucination heatmaps as PNG images for reports or presentations.

In [None]:
# Export visualizations as PNG images
os.makedirs(f"{OUTPUT_DIR}/visualizations", exist_ok=True)

if os.path.exists(generation_details_path):
    with open(generation_details_path, 'r') as f:
        data = json.load(f)
    
    generations = data.get('generations', [])
    
    print("Exporting hallucination heatmaps as PNG...")
    print("="*70)
    
    # Export first 2 examples
    for i, gen in enumerate(generations[:2]):
        response = gen.get('response', gen.get('original_response', ''))
        token_scores = gen.get('token_scores', [])
        eval_result = gen.get('eval_result', 'N/A')
        classifier_proba = gen.get('classifier_proba', 0)
        
        if not token_scores:
            continue
        
        title = f"Example {i+1}: {eval_result} ({classifier_proba:.1%} hallucination prob.)"
        output_path = f"{OUTPUT_DIR}/visualizations/hallucination_heatmap_{i+1}.png"
        save_hallucination_heatmap_png(response, token_scores, output_path, tokenizer, title=title)
    
    print(f"\nPNG files saved to: {OUTPUT_DIR}/visualizations/")
else:
    print("Run classifier training first to generate token scores.")

In [None]:
# Display one of the saved PNG images
from IPython.display import Image as IPImage

png_path = f"{OUTPUT_DIR}/visualizations/hallucination_heatmap_1.png"
if os.path.exists(png_path):
    print("Example PNG visualization:")
    display(IPImage(filename=png_path))
else:
    print("PNG not found. Run the export cell above first.")

## 10. Summary: CLI Commands Reference

### Extract Contrastive Pairs
```bash
python -m wisent.core.main generate-pairs-from-task \
    truthfulqa_gen \
    --output pairs.json \
    --limit 100
```

### Train Representation Reader (Classifier)
```bash
python -m wisent.core.main tasks \
    truthfulqa_gen \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --layer 8 \
    --classifier-type logistic \
    --token-aggregation average \
    --save-classifier classifier.pt \
    --output ./classifiers
```

### Generate Responses for Testing
```bash
python -m wisent.core.main generate-responses \
    meta-llama/Llama-3.2-1B-Instruct \
    --task truthfulqa_gen \
    --output responses.json \
    --num-questions 20
```

### Evaluate Responses
```bash
python -m wisent.core.main evaluate-responses \
    --input responses.json \
    --output evaluation.json \
    --task truthfulqa_gen
```

### Key Parameters:
- **`--layer`**: Which transformer layer to read activations from (middle layers often work best)
- **`--classifier-type`**: `logistic` (simpler, faster) or `mlp` (more expressive)
- **`--token-aggregation`**: How to combine token activations (`average`, `last`, `first`)
- **`--detection-threshold`**: Classification threshold (0.5 = balanced)

### Output Files:
- **`training_report.json`**: Classifier metrics (accuracy, F1, precision, recall, AUC)
- **`generation_details.json`**: Per-response and per-token hallucination scores
- **`classifier.pt`**: Saved classifier model for reuse

### Token-Level Scores:
The classifier outputs a hallucination probability for each token:
- **`classifier_proba`**: Overall response-level hallucination probability (0-1)
- **`token_scores`**: Array of per-token hallucination probabilities

### Key Concepts:
- **Representation Reading**: Using internal activations to detect model behavior
- **Non-invasive**: Does not modify the model, only monitors it
- **Real-time**: Can be used during inference to flag potential hallucinations
- **Interpretable**: Token-level scores show exactly where hallucinations begin
- **Visualizable**: Heatmaps make it easy to identify problematic regions