# LLM Instrumentation Framework - Complete Demo with Reasoning Model

This notebook demonstrates the complete capabilities of the **LLM Instrumentation Framework** using a small reasoning model. We'll capture activations separately during:
1. **Prompt processing** (input phase)
2. **Reasoning generation** (output phase)

## What You'll Learn
- How to configure and initialize the instrumentation framework
- How to instrument a transformer model
- How to capture activations at different granularities
- How to analyze captured activation data
- How to work with different compression algorithms
- How to separate prompt vs. generation activations

## 1. Setup and Installation

First, let's install the required dependencies and import the necessary modules.

In [None]:
# Install required packages (uncomment if needed)
# !pip install transformers torch accelerate

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import json

# Import the instrumentation framework
from llm_instrumentation import (
    InstrumentationFramework,
    InstrumentationConfig,
    HookGranularity,
)

print("‚úì All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Load a Small Reasoning Model

We'll use **Qwen2.5-0.5B-Instruct**, a small but capable reasoning model that's perfect for testing. It's only 500M parameters, so it runs quickly even on CPU.

In [None]:
# Model configuration
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

print(f"Loading model: {MODEL_NAME}")
print("This may take a minute...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("‚úì Tokenizer loaded")

# Load model (using CPU for compatibility, but GPU is supported)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32,  # Use float32 for CPU compatibility
    device_map="auto",
    trust_remote_code=True,
)
print("‚úì Model loaded")

# Print model architecture overview
print(f"\nModel architecture:")
print(f"  - Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  - Layers: {len(list(model.named_modules()))}")
print(f"  - Device: {next(model.parameters()).device}")

## 3. Configure the Instrumentation Framework

The framework supports multiple configurations:
- **Granularity**: What to capture (full tensors, attention only, MLP only, sampled slices)
- **Compression**: Algorithm to use (lz4, zstd, none)
- **Throughput**: Target streaming rate
- **Memory**: Maximum host RAM usage

We'll create multiple configurations to demonstrate different use cases.

In [None]:
# Configuration 1: Full tensor capture with LZ4 compression
config_full = InstrumentationConfig(
    granularity=HookGranularity.FULL_TENSOR,
    compression_algorithm="lz4",  # Fast compression
    target_throughput_gbps=2.0,
    max_memory_gb=8,
)

# Configuration 2: Attention-only capture (for interpretability)
config_attention = InstrumentationConfig(
    granularity=HookGranularity.ATTENTION_ONLY,
    compression_algorithm="zstd",  # Better compression ratio
    target_throughput_gbps=2.0,
    max_memory_gb=8,
)

# Configuration 3: MLP-only capture
config_mlp = InstrumentationConfig(
    granularity=HookGranularity.MLP_ONLY,
    compression_algorithm="none",  # No compression for speed
    target_throughput_gbps=2.0,
    max_memory_gb=8,
)

print("‚úì Configurations created:")
print(f"  1. Full tensor capture with LZ4")
print(f"  2. Attention-only with Zstd")
print(f"  3. MLP-only with no compression")

## 4. Prepare Reasoning Prompts

We'll use prompts that require step-by-step reasoning to demonstrate how the model processes information differently during thinking vs. output.

In [None]:
# Define reasoning prompts
prompts = [
    """Solve this step by step:
If a train travels 120 km in 2 hours, what is its average speed?
Show your reasoning.""",
    
    """Think carefully:
What is 15% of 200? Explain each step.""",
    
    """Reason through this:
If today is Wednesday, what day will it be in 10 days?""",
]

print("Reasoning prompts prepared:")
for i, prompt in enumerate(prompts, 1):
    print(f"\n{i}. {prompt[:50]}...")

## 5. Demo 1: Capture Prompt Processing Activations

In this demo, we'll capture activations **only during the prompt processing phase**. We'll use the attention-only configuration to focus on how the model attends to different parts of the input.

In [None]:
# Initialize framework for prompt capture
framework_prompt = InstrumentationFramework(config_attention)

# Register the model for instrumentation
framework_prompt.instrument_model(model)
print("‚úì Model instrumented for prompt capture")

# Select first prompt
test_prompt = prompts[0]
print(f"\nPrompt to process:\n{test_prompt}")

# Tokenize the prompt
inputs = tokenizer(test_prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

print(f"\nTokenized prompt:")
print(f"  - Input shape: {input_ids.shape}")
print(f"  - Number of tokens: {input_ids.shape[1]}")
print(f"  - Device: {input_ids.device}")

# Capture activations during prompt processing
output_path_prompt = "prompt_activations.stream"

print(f"\nüîç Capturing activations during PROMPT PROCESSING...")
print(f"   Output file: {output_path_prompt}")
print(f"   Granularity: ATTENTION_ONLY")
print(f"   Compression: zstd\n")

with framework_prompt.capture_activations(output_path_prompt):
    # Only forward pass (no generation)
    with torch.no_grad():
        outputs_prompt = model(input_ids)

print("‚úì Prompt activations captured successfully!")
print(f"  - Output shape: {outputs_prompt.logits.shape}")
print(f"  - File created: {os.path.exists(output_path_prompt)}")
print(f"  - File size: {os.path.getsize(output_path_prompt):,} bytes")

## 6. Demo 2: Capture Generation (Reasoning) Activations

Now we'll capture activations **during the generation phase**, when the model is actively reasoning and producing output tokens. This uses full tensor capture to get complete information.

In [None]:
# Initialize framework for generation capture
framework_generation = InstrumentationFramework(config_full)

# Register the model
framework_generation.instrument_model(model)
print("‚úì Model instrumented for generation capture")

# Prepare the same prompt
inputs = tokenizer(test_prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

# Capture activations during generation
output_path_generation = "generation_activations.stream"

print(f"\nüîç Capturing activations during GENERATION (Reasoning)...")
print(f"   Output file: {output_path_generation}")
print(f"   Granularity: FULL_TENSOR")
print(f"   Compression: lz4")
print(f"   Max new tokens: 100\n")

with framework_generation.capture_activations(output_path_generation):
    # Generate tokens (reasoning phase)
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )

# Decode the generated text
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print("‚úì Generation activations captured successfully!")
print(f"  - File created: {os.path.exists(output_path_generation)}")
print(f"  - File size: {os.path.getsize(output_path_generation):,} bytes")
print(f"\nüìù Generated response:\n")
print(f"{generated_text}")

## 7. Analyze Captured Activations

Now let's analyze both activation streams to see the differences between prompt processing and generation phases.

In [None]:
# Analyze prompt activations
print("\n" + "="*60)
print("ANALYSIS: PROMPT PROCESSING ACTIVATIONS")
print("="*60)

analysis_prompt = framework_prompt.analyze_activations(output_path_prompt)

print(f"\nüìä Overall Statistics:")
print(f"  - Stream path: {analysis_prompt['stream_path']}")
print(f"  - Compression: {analysis_prompt['compression']}")
print(f"  - Total packets: {analysis_prompt['packets']:,}")
print(f"  - Total compressed bytes: {analysis_prompt['total_compressed_bytes']:,}")

print(f"\nüìã Per-Layer Breakdown:")
for layer_name, stats in sorted(analysis_prompt['per_layer'].items()):
    if layer_name:  # Skip empty names
        avg_size = stats['bytes'] / stats['count'] if stats['count'] > 0 else 0
        print(f"  {layer_name:30s} ‚Üí {stats['count']:3d} packets, {stats['bytes']:10,} bytes (avg: {avg_size:,.0f} bytes/packet)")

# Analyze generation activations
print("\n" + "="*60)
print("ANALYSIS: GENERATION (REASONING) ACTIVATIONS")
print("="*60)

analysis_generation = framework_generation.analyze_activations(output_path_generation)

print(f"\nüìä Overall Statistics:")
print(f"  - Stream path: {analysis_generation['stream_path']}")
print(f"  - Compression: {analysis_generation['compression']}")
print(f"  - Total packets: {analysis_generation['packets']:,}")
print(f"  - Total compressed bytes: {analysis_generation['total_compressed_bytes']:,}")

print(f"\nüìã Per-Layer Breakdown (top 10 by packet count):")
sorted_layers = sorted(analysis_generation['per_layer'].items(), 
                      key=lambda x: x[1]['count'], 
                      reverse=True)
for layer_name, stats in sorted_layers[:10]:
    if layer_name:
        avg_size = stats['bytes'] / stats['count'] if stats['count'] > 0 else 0
        print(f"  {layer_name:30s} ‚Üí {stats['count']:3d} packets, {stats['bytes']:10,} bytes (avg: {avg_size:,.0f} bytes/packet)")

## 8. Compare Prompt vs. Generation Phases

Let's create a comparison to understand how the model behaves differently during prompt processing vs. generation.

In [None]:
print("\n" + "="*60)
print("COMPARISON: PROMPT vs GENERATION")
print("="*60)

comparison = {
    "Metric": [
        "Total Packets",
        "Total Compressed Bytes",
        "Unique Layers",
        "Compression Algorithm",
        "Granularity",
    ],
    "Prompt Processing": [
        f"{analysis_prompt['packets']:,}",
        f"{analysis_prompt['total_compressed_bytes']:,}",
        f"{len(analysis_prompt['per_layer'])}",
        analysis_prompt['compression'],
        "ATTENTION_ONLY",
    ],
    "Generation (Reasoning)": [
        f"{analysis_generation['packets']:,}",
        f"{analysis_generation['total_compressed_bytes']:,}",
        f"{len(analysis_generation['per_layer'])}",
        analysis_generation['compression'],
        "FULL_TENSOR",
    ],
}

# Print comparison table
print(f"\n{'Metric':<30} {'Prompt Processing':<30} {'Generation'}")
print("-" * 80)
for i, metric in enumerate(comparison["Metric"]):
    print(f"{metric:<30} {comparison['Prompt Processing'][i]:<30} {comparison['Generation (Reasoning)'][i]}")

# Calculate ratio
if analysis_prompt['packets'] > 0:
    packet_ratio = analysis_generation['packets'] / analysis_prompt['packets']
    byte_ratio = analysis_generation['total_compressed_bytes'] / analysis_prompt['total_compressed_bytes']
    
    print(f"\nüìà Generation/Prompt Ratios:")
    print(f"  - Packet ratio: {packet_ratio:.2f}x more packets during generation")
    print(f"  - Data ratio: {byte_ratio:.2f}x more data during generation")
    print(f"\nüí° Interpretation:")
    print(f"  Generation captures {packet_ratio:.1f}x more activations because:")
    print(f"  1. Multiple forward passes (one per generated token)")
    print(f"  2. Full tensor capture vs attention-only")
    print(f"  3. The model processes {generated_ids.shape[1] - input_ids.shape[1]} new tokens")

## 9. Demo 3: MLP-Only Capture for Feature Analysis

Let's demonstrate MLP-only capture, which is useful for understanding how the model transforms representations in the feedforward layers.

In [None]:
# Initialize framework for MLP capture
framework_mlp = InstrumentationFramework(config_mlp)
framework_mlp.instrument_model(model)

print("‚úì Model instrumented for MLP-only capture")

# Use a different prompt
mlp_prompt = prompts[1]
print(f"\nPrompt: {mlp_prompt}\n")

inputs = tokenizer(mlp_prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

output_path_mlp = "mlp_activations.stream"

print(f"üîç Capturing MLP activations...")
print(f"   Granularity: MLP_ONLY")
print(f"   Compression: none\n")

with framework_mlp.capture_activations(output_path_mlp):
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            max_new_tokens=50,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
        )

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print("‚úì MLP activations captured!")
print(f"  - File size: {os.path.getsize(output_path_mlp):,} bytes")
print(f"\nüìù Generated response:\n{generated_text}")

# Analyze MLP activations
analysis_mlp = framework_mlp.analyze_activations(output_path_mlp)

print(f"\nüìä MLP Analysis:")
print(f"  - Total packets: {analysis_mlp['packets']:,}")
print(f"  - Total bytes (uncompressed): {analysis_mlp['total_compressed_bytes']:,}")
print(f"  - Unique MLP layers captured: {len(analysis_mlp['per_layer'])}")

if analysis_mlp['per_layer']:
    print(f"\nüìã MLP Layers:")
    for layer_name in sorted(analysis_mlp['per_layer'].keys())[:5]:
        if layer_name:
            print(f"  - {layer_name}")

## 10. Verification: Compare Compression Algorithms

Let's verify compression effectiveness by comparing file sizes across different algorithms.

In [None]:
# Test all compression algorithms with the same prompt and settings
compression_results = {}
test_prompt_short = "What is 2 + 2? Explain."
inputs_test = tokenizer(test_prompt_short, return_tensors="pt")
input_ids_test = inputs_test["input_ids"].to(model.device)

print("\nüß™ Testing compression algorithms...\n")

for algo in ["none", "lz4", "zstd"]:
    config_test = InstrumentationConfig(
        granularity=HookGranularity.FULL_TENSOR,
        compression_algorithm=algo,
        target_throughput_gbps=2.0,
        max_memory_gb=8,
    )
    
    framework_test = InstrumentationFramework(config_test)
    framework_test.instrument_model(model)
    
    output_path_test = f"compression_test_{algo}.stream"
    
    print(f"Testing {algo:6s} compression...", end=" ")
    
    with framework_test.capture_activations(output_path_test):
        with torch.no_grad():
            _ = model.generate(
                input_ids_test,
                max_new_tokens=30,
                pad_token_id=tokenizer.eos_token_id,
            )
    
    file_size = os.path.getsize(output_path_test)
    compression_results[algo] = file_size
    print(f"‚úì {file_size:,} bytes")

# Calculate compression ratios
print("\n" + "="*60)
print("COMPRESSION COMPARISON")
print("="*60)

baseline = compression_results["none"]
print(f"\n{'Algorithm':<10} {'Size (bytes)':<15} {'Compression Ratio':<20} {'Reduction'}")
print("-" * 70)

for algo, size in compression_results.items():
    ratio = baseline / size if size > 0 else 0
    reduction = (1 - size/baseline) * 100 if baseline > 0 else 0
    print(f"{algo:<10} {size:<15,} {ratio:<20.2f}x {reduction:>5.1f}% reduction")

print(f"\nüí° Best compression: {min(compression_results, key=compression_results.get)}")
print(f"üí° Fastest (no compression): none")

## 11. Summary and Verification

Let's verify that everything worked correctly and summarize what we've learned.

In [None]:
print("\n" + "="*60)
print("VERIFICATION CHECKLIST")
print("="*60)

checks = [
    ("Model loaded successfully", model is not None),
    ("Prompt activations captured", os.path.exists(output_path_prompt)),
    ("Generation activations captured", os.path.exists(output_path_generation)),
    ("MLP activations captured", os.path.exists(output_path_mlp)),
    ("Prompt analysis completed", analysis_prompt['packets'] > 0),
    ("Generation analysis completed", analysis_generation['packets'] > 0),
    ("MLP analysis completed", analysis_mlp['packets'] > 0),
    ("Compression test completed", len(compression_results) == 3),
    ("Model generated valid output", len(generated_text) > len(test_prompt)),
]

all_passed = True
for check_name, result in checks:
    status = "‚úì" if result else "‚úó"
    print(f"{status} {check_name}")
    if not result:
        all_passed = False

print("\n" + "="*60)
if all_passed:
    print("‚úì ALL CHECKS PASSED - Framework is working correctly!")
else:
    print("‚úó Some checks failed - please review the output above")
print("="*60)

# Print summary
print("\n" + "="*60)
print("SUMMARY: What You've Learned")
print("="*60)
print("""
‚úì How to configure the instrumentation framework with different settings
‚úì How to instrument a transformer model for activation capture
‚úì How to capture activations separately for:
  - Prompt processing (input phase)
  - Generation/reasoning (output phase)
‚úì How to use different granularities:
  - FULL_TENSOR: Complete activation capture
  - ATTENTION_ONLY: Focus on attention mechanisms
  - MLP_ONLY: Focus on feedforward layers
‚úì How to compare compression algorithms:
  - none: Fastest, largest files
  - lz4: Fast compression, good ratio
  - zstd: Best compression, slower
‚úì How to analyze captured activation streams
‚úì How to interpret per-layer statistics
""")

print("="*60)
print("FILES GENERATED")
print("="*60)
print(f"""
1. {output_path_prompt} ({os.path.getsize(output_path_prompt):,} bytes)
   ‚Üí Attention-only activations from prompt processing
   
2. {output_path_generation} ({os.path.getsize(output_path_generation):,} bytes)
   ‚Üí Full tensor activations from reasoning/generation
   
3. {output_path_mlp} ({os.path.getsize(output_path_mlp):,} bytes)
   ‚Üí MLP-only activations (uncompressed)
   
4. compression_test_*.stream (3 files)
   ‚Üí Compression algorithm comparison
""")

print("\n" + "="*60)
print("NEXT STEPS")
print("="*60)
print("""
Now you can:
1. Use these activation streams for downstream analysis
2. Feed them into interpretability tools (SAE, causal graphs)
3. Correlate activations with model behavior
4. Optimize model performance based on activation patterns
5. Compare activations across different prompts/models

For more information, see the documentation:
- docs/API.md - Complete API reference
- docs/STREAM_FORMAT.md - Stream format specification
- docs/ARCHITECTURE.md - Framework architecture
""")

## 12. Cleanup (Optional)

Uncomment and run this cell to remove generated files.

In [None]:
# import os

# files_to_remove = [
#     output_path_prompt,
#     output_path_generation,
#     output_path_mlp,
#     "compression_test_none.stream",
#     "compression_test_lz4.stream",
#     "compression_test_zstd.stream",
# ]

# for file_path in files_to_remove:
#     if os.path.exists(file_path):
#         os.remove(file_path)
#         print(f"Removed: {file_path}")

# print("\n‚úì Cleanup complete!")