# Concept Activation Vectors (CAVs) for "Thinking Fast and Slow"

This notebook demonstrates how to implement Concept Activation Vectors (CAVs) to replicate cognitive theories from Daniel Kahneman's "Thinking Fast and Slow". We'll focus on implementing System 1 (fast thinking) vs System 2 (slow thinking) and various cognitive biases/heuristics described in the book.

## Overview

Concept Activation Vectors (CAVs) are a technique for steering language model behavior by manipulating internal neural activations. In this notebook, we'll:

1. Set up the environment and load a language model
2. Create a CAV implementation for manipulating model behavior
3. Train CAVs for System 1 and System 2 thinking patterns
4. Train CAVs for specific cognitive biases from the book
5. Test the CAVs on examples from "Thinking Fast and Slow"
6. Evaluate the results

Let's get started!

## 1. Setup and Dependencies

First, let's install the necessary dependencies and set up for Apple Silicon compatibility.

In [1]:
# Apple Silicon (M1/M2/M3) compatibility settings
# IMPORTANT: Run this cell first, before importing any other libraries
import os

# Completely disable MPS (Metal Performance Shaders)
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
os.environ["PYTORCH_NO_MPS"] = "1"  # This is critical - forces PyTorch to ignore MPS

# Install required packages if needed
# !pip install torch transformers datasets numpy matplotlib scikit-learn tqdm

In [2]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.linear_model import LogisticRegression
from tqdm.notebook import tqdm
import json
import random
from typing import List, Dict, Tuple, Optional, Union, Any

# Force CPU for all PyTorch operations
if hasattr(torch, "_C") and hasattr(torch._C, "_set_default_device"):
    torch._C._set_default_device("cpu")
device = "cpu"

# Double-check we're not using MPS
if hasattr(torch.backends, "mps"):
    print(f"MPS available: {torch.backends.mps.is_available()}")
    if torch.backends.mps.is_available():
        print("WARNING: MPS is still available despite disabling it. Forcing CPU usage.")

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

print(f"Using device: {device} (forced for Apple Silicon compatibility)")

  cpu = _conversion_method_template(device=torch.device("cpu"))


ModuleNotFoundError: No module named 'numpy'

## 2. Load a Language Model

We'll use a smaller open-source model that allows us to access and manipulate internal activations. For this demonstration, we'll use a smaller model like OPT-1.3B or GPT-2, but you can replace it with any model you prefer.

In [4]:
# Define model name - using a smaller model that's more accessible
model_name = "facebook/opt-1.3b"  # Alternative options: "gpt2", "EleutherAI/pythia-1.4b", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load model and tokenizer
print(f"Loading {model_name}...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,  # Use float32 for CPU
    device_map=None  # Don't use device_map with CPU
)

# Explicitly move model to CPU
model = model.to("cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Verify model is on CPU
print(f"Model device: {next(model.parameters()).device}")
assert str(next(model.parameters()).device) == "cpu", "Model must be on CPU"
print("Model loaded successfully!")

Loading facebook/opt-1.3b...
Model device: cpu
Model loaded successfully!


## 3. Concept Activation Vectors Implementation

Now, let's implement the core CAV functionality. This includes:
- Registering hooks to access model activations
- Collecting activations for contrastive examples
- Training classifiers to identify concept directions
- Applying steering during generation

In [5]:
class ConceptActivationVectors:
    """
    Implementation of Concept Activation Vectors (CAVs) for steering LLM behavior.
    """
    
    def __init__(self, model, tokenizer, device="cpu"):
        """
        Initialize the CAV implementation with a specified model.
        
        Args:
            model: Hugging Face model
            tokenizer: Hugging Face tokenizer
            device: Device to run the model on (cuda or cpu)
        """
        self.model = model
        self.tokenizer = tokenizer
        self.device = "cpu"  # Force CPU for Apple Silicon compatibility
        
        # Dictionary to store concept vectors for different concepts
        self.concept_vectors = {}
        
        # Hooks for accessing activations
        self.hooks = []
        self.activation_storage = {}
        
        # Verify model is on CPU
        assert str(next(model.parameters()).device) == "cpu", "Model must be on CPU"
    
    def _register_hooks(self, layers):
        """
        Register hooks to capture activations from specified layers.
        
        Args:
            layers: List of layer indices to capture activations from
        """
        # Remove any existing hooks
        self._remove_hooks()
        
        # Register new hooks
        for layer_idx in layers:
            try:
                # Access transformer layers (implementation depends on model architecture)
                if "llama" in str(type(self.model)).lower():
                    layer = self.model.model.layers[layer_idx]
                elif "mistral" in str(type(self.model)).lower():
                    layer = self.model.model.layers[layer_idx]
                elif "opt" in str(type(self.model)).lower():
                    layer = self.model.model.decoder.layers[layer_idx]
                elif "gpt2" in str(type(self.model)).lower():
                    layer = self.model.transformer.h[layer_idx]
                else:
                    # Generic approach for other models
                    layer = list(self.model.modules())[layer_idx]
                
                # Define hook function to store activations
                def get_hook_fn(layer_id):
                    def hook_fn(module, input, output):
                        # Handle case where output is a tuple (common in transformer models)
                        try:
                            if isinstance(output, tuple):
                                # Usually the first element contains the hidden states
                                self.activation_storage[layer_id] = output[0].detach().to("cpu")
                            else:
                                # Original code for when output is a tensor
                                self.activation_storage[layer_id] = output.detach().to("cpu")
                        except Exception as e:
                            print(f"Error in hook for layer {layer_id}: {e}")
                            print(f"Output type: {type(output)}")
                            if isinstance(output, tuple):
                                print(f"Tuple length: {len(output)}")
                    return hook_fn
                
                # Register the hook
                hook = layer.register_forward_hook(get_hook_fn(layer_idx))
                self.hooks.append(hook)
                print(f"Registered hook for layer {layer_idx}")
            except Exception as e:
                print(f"Error registering hook for layer {layer_idx}: {e}")
    
    def _remove_hooks(self):
        """Remove all registered hooks."""
        for hook in self.hooks:
            hook.remove()
        self.hooks = []
        self.activation_storage = {}
        print("Removed all hooks")
    
    def collect_activations(self, prompts, layers):
        """
        Collect activations from the model for a list of prompts.
        
        Args:
            prompts: List of text prompts
            layers: List of layer indices to collect activations from
            
        Returns:
            Dictionary mapping layer indices to lists of activation tensors
        """
        self._register_hooks(layers)
        activations = {layer: [] for layer in layers}
        
        for prompt in tqdm(prompts, desc="Collecting activations"):
            # Ensure inputs are on CPU
            inputs = self.tokenizer(prompt, return_tensors="pt")
            inputs = {k: v.to("cpu") for k, v in inputs.items()}
            
            # Run the model in evaluation mode
            self.model.eval()
            with torch.no_grad():
                try:
                    self.model(**inputs)
                    
                    # Store activations
                    for layer in layers:
                        if layer in self.activation_storage:
                            # Store a copy of the activation to avoid reference issues
                            activations[layer].append(self.activation_storage[layer].clone())
                        else:
                            print(f"Warning: No activation stored for layer {layer}")
                except Exception as e:
                    print(f"Error running model for prompt '{prompt[:30]}...': {e}")
        
        self._remove_hooks()
        return activations
    
    def train_concept_vector(self, concept_name, positive_prompts, negative_prompts, layers):
        """
        Train a concept activation vector using contrastive examples.
        
        Args:
            concept_name: Name of the concept (e.g., "system1_thinking", "anchoring_bias")
            positive_prompts: Prompts that exhibit the concept
            negative_prompts: Prompts that don't exhibit the concept
            layers: List of layer indices to train CAVs for
            
        Returns:
            Dictionary mapping layer indices to concept vectors
        """
        print(f"Training concept vector for '{concept_name}'...")
        
        # Collect activations for positive and negative examples
        positive_activations = self.collect_activations(positive_prompts, layers)
        negative_activations = self.collect_activations(negative_prompts, layers)
        
        # Train a classifier for each layer
        concept_vectors = {}
        
        for layer in layers:
            if layer not in positive_activations or not positive_activations[layer] or \
               layer not in negative_activations or not negative_activations[layer]:
                print(f"Skipping layer {layer} due to missing activations")
                continue
                
            # Prepare training data
            try:
                # Debug activation shapes
                print(f"Layer {layer} activation shapes:")
                for i, act in enumerate(positive_activations[layer]):
                    print(f"  Positive example {i}: {act.shape}")
                for i, act in enumerate(negative_activations[layer]):
                    print(f"  Negative example {i}: {act.shape}")
                
                # Use mean pooling to get fixed-size representations
                # This avoids dimension mismatch issues
                X_positive = torch.stack([act.mean(dim=1).squeeze() for act in positive_activations[layer]]).cpu().numpy()
                X_negative = torch.stack([act.mean(dim=1).squeeze() for act in negative_activations[layer]]).cpu().numpy()
                
                print(f"After pooling - X_positive shape: {X_positive.shape}, X_negative shape: {X_negative.shape}")
                
                X = np.vstack([X_positive, X_negative])
                y = np.array([1] * len(positive_prompts) + [0] * len(negative_prompts))
                
                # Train logistic regression classifier
                classifier = LogisticRegression(class_weight='balanced', max_iter=1000)
                classifier.fit(X, y)
                
                # Extract the concept vector (normal to the decision boundary)
                concept_vector = classifier.coef_[0]
                
                # Normalize the vector
                concept_vector = concept_vector / np.linalg.norm(concept_vector)
                
                # Store the concept vector
                concept_vectors[layer] = torch.tensor(concept_vector, dtype=torch.float32).to("cpu")
                
                print(f"  Layer {layer}: Classifier accuracy = {classifier.score(X, y):.4f}")
            except Exception as e:
                print(f"Error training classifier for layer {layer}: {e}")
                import traceback
                traceback.print_exc()
        
        # Store the concept vectors
        self.concept_vectors[concept_name] = concept_vectors
        
        return concept_vectors
    
    def apply_steering(self, prompt, concept_name, layers, steering_strength=1.0, max_tokens=100):
        """
        Apply concept steering during generation.
        
        Args:
            prompt: Input prompt
            concept_name: Name of the concept to steer towards/away from
            layers: List of layer indices to apply steering to
            steering_strength: Strength of steering (positive for enhancing, negative for suppressing)
            max_tokens: Maximum number of tokens to generate
            
        Returns:
            Generated text with concept steering applied
        """
        if concept_name not in self.concept_vectors:
            raise ValueError(f"Concept '{concept_name}' not found. Train it first.")
        
        # Ensure we have concept vectors for all specified layers
        valid_layers = []
        for layer in layers:
            if layer in self.concept_vectors[concept_name]:
                valid_layers.append(layer)
            else:
                print(f"Warning: No concept vector for layer {layer} and concept '{concept_name}'. Skipping.")
        
        if not valid_layers:
            raise ValueError(f"No valid layers found for concept '{concept_name}'")
        
        # Tokenize the prompt
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cpu")
        
        # Define a forward hook that applies steering
        def steering_hook(layer_idx, module, input, output):
            try:
                # Get the concept vector for this layer
                concept_vector = self.concept_vectors[concept_name][layer_idx].to("cpu")
                
                # Handle case where output is a tuple
                if isinstance(output, tuple):
                    # Get the first element (usually the hidden states)
                    original_output = output[0]
                    
                    # Use mean pooling to match dimensions
                    # Reshape the concept vector to match the output shape
                    hidden_dim = original_output.size(-1)
                    if concept_vector.size(0) != hidden_dim:
                        # Resize concept vector if dimensions don't match
                        print(f"Resizing concept vector from {concept_vector.size(0)} to {hidden_dim}")
                        if concept_vector.size(0) > hidden_dim:
                            concept_vector = concept_vector[:hidden_dim]
                        else:
                            # Pad with zeros if concept vector is smaller
                            padding = torch.zeros(hidden_dim - concept_vector.size(0), device="cpu")
                            concept_vector = torch.cat([concept_vector, padding])
                    
                    # Reshape for broadcasting
                    reshaped_vector = concept_vector.reshape(1, 1, -1)
                    
                    # Apply steering
                    modified_output = original_output + steering_strength * reshaped_vector
                    
                    # Return a new tuple with the modified first element
                    return (modified_output,) + output[1:]
                else:
                    # Original code for when output is a tensor
                    hidden_dim = output.size(-1)
                    if concept_vector.size(0) != hidden_dim:
                        # Resize concept vector if dimensions don't match
                        print(f"Resizing concept vector from {concept_vector.size(0)} to {hidden_dim}")
                        if concept_vector.size(0) > hidden_dim:
                            concept_vector = concept_vector[:hidden_dim]
                        else:
                            # Pad with zeros if concept vector is smaller
                            padding = torch.zeros(hidden_dim - concept_vector.size(0), device="cpu")
                            concept_vector = torch.cat([concept_vector, padding])
                    
                    reshaped_vector = concept_vector.reshape(1, 1, -1)
                    modified_output = output + steering_strength * reshaped_vector
                    return modified_output
            except Exception as e:
                print(f"Error in steering hook for layer {layer_idx}: {e}")
                import traceback
                traceback.print_exc()
                return output  # Return original output on error
        
        # Register steering hooks
        hooks = []
        for layer_idx in valid_layers:
            try:
                if "llama" in str(type(self.model)).lower():
                    layer = self.model.model.layers[layer_idx]
                elif "mistral" in str(type(self.model)).lower():
                    layer = self.model.model.layers[layer_idx]
                elif "opt" in str(type(self.model)).lower():
                    layer = self.model.model.decoder.layers[layer_idx]
                elif "gpt2" in str(type(self.model)).lower():
                    layer = self.model.transformer.h[layer_idx]
                else:
                    layer = list(self.model.modules())[layer_idx]
                
                hook = layer.register_forward_hook(
                    lambda module, input, output, layer_idx=layer_idx: 
                    steering_hook(layer_idx, module, input, output)
                )
                hooks.append(hook)
                print(f"Registered steering hook for layer {layer_idx}")
            except Exception as e:
                print(f"Error registering steering hook for layer {layer_idx}: {e}")
        
        # Generate text with steering applied
        self.model.eval()
        with torch.no_grad():
            try:
                outputs = self.model.generate(
                    inputs.input_ids,
                    max_new_tokens=max_tokens,
                    do_sample=True,
                    temperature=0.7,
                    top_p=0.9
                )
            except Exception as e:
                print(f"Error during generation: {e}")
                import traceback
                traceback.print_exc()
                # Remove hooks before re-raising
                for hook in hooks:
                    hook.remove()
                raise
        
        # Remove hooks
        for hook in hooks:
            hook.remove()
        
        # Decode the generated text
        generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return generated_text
    
    def save_concept_vectors(self, filepath):
        """Save trained concept vectors to a file."""
        # Convert tensors to numpy arrays for saving
        serializable_vectors = {}
        for concept, layers in self.concept_vectors.items():
            serializable_vectors[concept] = {
                str(layer): vector.cpu().numpy().tolist() 
                for layer, vector in layers.items()
            }
        
        with open(filepath, 'w') as f:
            json.dump(serializable_vectors, f)
        
        print(f"Concept vectors saved to {filepath}")
    
    def load_concept_vectors(self, filepath):
        """Load trained concept vectors from a file."""
        with open(filepath, 'r') as f:
            serializable_vectors = json.load(f)
        
        # Convert back to tensors
        self.concept_vectors = {}
        for concept, layers in serializable_vectors.items():
            self.concept_vectors[concept] = {
                int(layer): torch.tensor(vector, dtype=torch.float32).to("cpu")
                for layer, vector in layers.items()
            }
        
        print(f"Loaded concept vectors for: {list(self.concept_vectors.keys())}")

## 4. Thinking Fast and Slow Implementation

Now, let's create a specialized class for implementing concepts from "Thinking Fast and Slow" using CAVs.

In [6]:
class ThinkingFastSlowCAV:
    """
    Implementation of Concept Activation Vectors specifically for replicating
    theories from "Thinking Fast and Slow" by Daniel Kahneman.
    """
    
    def __init__(self, model, tokenizer, device="cpu"):
        """
        Initialize the implementation.
        
        Args:
            model: Hugging Face model
            tokenizer: Hugging Face tokenizer
            device: Device to run the model on (cuda or cpu)
        """
        # Force CPU for Apple Silicon compatibility
        self.device = "cpu"
        self.cav = ConceptActivationVectors(model, tokenizer, self.device)
        
        # Create data directory if it doesn't exist
        os.makedirs("thinking_fast_slow_data", exist_ok=True)
        
        # Determine appropriate layers based on model architecture
        model_type = str(type(model)).lower()
        
        # Adjust layer indices based on model architecture
        if "gpt2" in model_type:
            # GPT-2 has 12 layers for small, 24 for medium, etc.
            num_layers = len(model.transformer.h)
            self.system1_layers = [int(num_layers * 0.2), int(num_layers * 0.4)]  # Earlier layers for System 1 thinking
            self.system2_layers = [int(num_layers * 0.6), int(num_layers * 0.8)]  # Later layers for System 2 thinking
            self.bias_layers = [int(num_layers * 0.4), int(num_layers * 0.6)]     # Middle layers for cognitive biases
        elif "opt" in model_type:
            # OPT models have varying numbers of layers
            num_layers = len(model.model.decoder.layers)
            self.system1_layers = [int(num_layers * 0.2), int(num_layers * 0.4)]  # Earlier layers for System 1 thinking
            self.system2_layers = [int(num_layers * 0.6), int(num_layers * 0.8)]  # Later layers for System 2 thinking
            self.bias_layers = [int(num_layers * 0.4), int(num_layers * 0.6)]     # Middle layers for cognitive biases
        else:
            # Default layer configuration for other models
            self.system1_layers = [2, 4]  # Earlier layers for System 1 thinking
            self.system2_layers = [6, 8]  # Later layers for System 2 thinking
            self.bias_layers = [4, 6]     # Middle layers for cognitive biases
            
        print(f"Model architecture: {model_type}")
        print(f"System 1 layers: {self.system1_layers}")
        print(f"System 2 layers: {self.system2_layers}")
        print(f"Bias layers: {self.bias_layers}")
        
        # Initialize examples for different concepts
        self.examples = self._initialize_examples()
    
    def _initialize_examples(self):
        """
        Initialize examples for different concepts from "Thinking Fast and Slow".
        
        Returns:
            Dictionary mapping concept names to (positive_examples, negative_examples)
        """
        examples = {}
        
        # System 1 vs System 2 thinking
        examples["system1_thinking"] = (
            [
                "What is your immediate reaction to this image?",
                "How do you feel about this situation?",
                "What's your gut feeling about this person?",
                "What's the first thing that comes to mind?",
                "Make a quick decision about which option is better.",
                "Is this person trustworthy based on their appearance?",
                "Do you like this painting?",
                "What's your immediate impression of this restaurant?",
                "Does this feel right to you?",
                "What's your instinctive response to this offer?"
            ],
            [
                "Analyze the logical structure of this argument.",
                "Calculate the expected value of this investment.",
                "What are the statistical probabilities in this scenario?",
                "Provide a step-by-step analysis of this problem.",
                "Consider all possible outcomes before making a decision.",
                "What is the mathematical formula that describes this relationship?",
                "Evaluate the evidence supporting this claim.",
                "What are the logical fallacies in this reasoning?",
                "Calculate the compound interest over 10 years.",
                "What is the most rational approach to this problem?"
            ]
        )
        
        examples["system2_thinking"] = (
            examples["system1_thinking"][1],  # Positive examples for System 2 are negative for System 1
            examples["system1_thinking"][0]   # Negative examples for System 2 are positive for System 1
        )
        
        # Cognitive biases and heuristics
        examples["priming"] = (
            [
                "After discussing food, complete the word SO_P.",
                "After talking about cleanliness, complete the word SO_P.",
                "After seeing elderly people, describe how quickly you walk.",
                "After being asked to smile, rate how funny this joke is.",
                "After reading about money, decide how to help someone.",
                "After seeing images of libraries, how quietly do you speak?",
                "After discussing speed, estimate how fast this car is moving.",
                "After watching a sad movie, describe your current mood.",
                "After reading about luxury, choose between these products.",
                "After seeing examples of creativity, solve this problem."
            ],
            [
                "Complete the word SO_P without any context.",
                "Describe how quickly you walk without any prior discussion.",
                "Rate how funny this joke is objectively.",
                "Decide how to help someone based solely on their needs.",
                "Estimate how fast this car is moving based on physics.",
                "Describe your mood based on internal reflection only.",
                "Choose between these products based on their specifications.",
                "Solve this problem using logical reasoning.",
                "Make a decision based purely on the facts presented.",
                "Evaluate this situation without any external influences."
            ]
        )
        
        examples["cognitive_ease"] = (
            [
                "This statement is easy to read and familiar, so it must be true.",
                "I've heard this information many times, so it's probably accurate.",
                "This concept is easy to understand, so it must be correct.",
                "This explanation flows smoothly, so it's likely valid.",
                "This idea feels familiar, so I believe it.",
                "This rhyming slogan seems more accurate than the non-rhyming one.",
                "This clearly printed text seems more truthful than the blurry one.",
                "I've seen this brand many times, so it must be good quality.",
                "This information comes from a source I like, so it's credible.",
                "This concept fits with my existing beliefs, so it's probably true."
            ],
            [
                "Let me evaluate this statement based on evidence, not readability.",
                "Frequency of exposure doesn't determine accuracy of information.",
                "I should judge this concept by its logical merit, not ease of understanding.",
                "The validity of an explanation isn't determined by how smoothly it flows.",
                "Familiarity doesn't imply truth; I need to verify this idea.",
                "Rhyming doesn't make a slogan more accurate than a non-rhyming one.",
                "Print clarity doesn't affect the truthfulness of text.",
                "Brand recognition doesn't guarantee quality; I need to assess objectively.",
                "I should evaluate information based on evidence, not how much I like the source.",
                "Compatibility with existing beliefs doesn't determine truth; I need to examine the facts."
            ]
        )
        
        examples["anchoring_bias"] = (
            [
                "The initial price was $1000. What do you think is a fair price?",
                "The suggested donation amount is $50. How much would you like to donate?",
                "The average score on this test is 85. What score do you expect to get?",
                "This house was previously listed at $500,000. What would you offer?",
                "The recommended daily steps are 10,000. How many steps do you think you should take?",
                "The last bid was $200. What's your bid for this item?",
                "Most people spend 2 hours on this task. How long do you think it will take you?",
                "The speed limit here is 65 mph. How fast were you driving?",
                "The standard tip is 20%. How much would you like to tip?",
                "The CEO earns $5 million annually. What's a fair salary for the VP?"
            ],
            [
                "What do you think is a fair price for this item?",
                "How much would you like to donate to this cause?",
                "What score do you expect to get on this test?",
                "What would you offer for this house?",
                "How many steps do you think you should take daily?",
                "What's your bid for this item?",
                "How long do you think this task will take you?",
                "How fast were you driving?",
                "How much would you like to tip?",
                "What's a fair salary for the VP?"
            ]
        )
        
        examples["framing_effect"] = (
            [
                "The treatment has a 70% success rate. Would you recommend it?",
                "This investment has a 60% chance of making a profit. Is it worth it?",
                "The program will save 200 out of 600 lives. Do you support it?",
                "This policy will create jobs for 5% of the unemployed. Should we implement it?",
                "The product has satisfied 90% of customers. Would you buy it?",
                "This surgery has an 80% survival rate. Would you undergo it?",
                "This diet plan helps 70% of people lose weight. Would you try it?",
                "This security system prevents 75% of break-ins. Is it effective?",
                "This vaccine protects 95% of recipients. Would you get vaccinated?",
                "This educational program improves test scores for 65% of students. Is it valuable?"
            ],
            [
                "The treatment has a 30% failure rate. Would you recommend it?",
                "This investment has a 40% chance of losing money. Is it worth it?",
                "The program will allow 400 out of 600 people to die. Do you support it?",
                "This policy will leave 95% of the unemployed without jobs. Should we implement it?",
                "The product has dissatisfied 10% of customers. Would you buy it?",
                "This surgery has a 20% mortality rate. Would you undergo it?",
                "This diet plan fails to help 30% of people lose weight. Would you try it?",
                "This security system fails to prevent 25% of break-ins. Is it effective?",
                "This vaccine leaves 5% of recipients unprotected. Would you get vaccinated?",
                "This educational program doesn't improve test scores for 35% of students. Is it valuable?"
            ]
        )
        
        examples["availability_bias"] = (
            [
                "After hearing about a plane crash, how safe do you feel flying?",
                "After reading about shark attacks, how dangerous do you think swimming in the ocean is?",
                "After seeing news about lottery winners, how likely do you think winning the lottery is?",
                "After hearing about a terrorist attack, how concerned are you about terrorism?",
                "After reading about a rare disease, how worried are you about contracting it?",
                "After watching a documentary about serial killers, how safe do you feel walking alone?",
                "After hearing about a friend's divorce, how stable do you think marriages are?",
                "After reading about a stock market crash, how risky do you think investing is?",
                "After seeing news about car accidents, how dangerous do you think driving is?",
                "After hearing about a home invasion, how concerned are you about your home security?"
            ],
            [
                "Based on statistics, how safe is flying compared to other forms of transportation?",
                "Statistically, how dangerous is swimming in the ocean?",
                "What are the actual odds of winning the lottery?",
                "Based on data, how likely are you to be affected by terrorism?",
                "What is the statistical prevalence of this disease in the population?",
                "What is the statistical likelihood of being a victim of violent crime?",
                "What percentage of marriages end in divorce according to recent statistics?",
                "What is the historical long-term performance of the stock market?",
                "What is the statistical risk of being in a car accident per mile driven?",
                "What is the actual rate of home invasions in your neighborhood?"
            ]
        )
        
        examples["loss_aversion"] = (
            [
                "Would you accept a bet with a 50% chance to lose $100 and a 50% chance to win $150?",
                "Would you sell this item that has sentimental value for twice what you paid for it?",
                "Would you risk losing your current job for a 60% chance at a better one?",
                "Would you give up your current phone for a 70% chance at a better model?",
                "Would you trade your current car plus cash for a newer model?",
                "Would you sell your concert tickets for 50% more than you paid?",
                "Would you risk your $1000 investment for a 60% chance to earn $2000?",
                "Would you give up your vacation plans for a refund plus 20%?",
                "Would you exchange your current laptop for a different model?",
                "Would you cancel your subscription for a 40% refund?"
            ],
            [
                "Would you accept a bet with a 50% chance to win $150 and a 50% chance to lose $100?",
                "Would you buy this item with sentimental value for twice its market price?",
                "Would you take a new job with a 60% chance of being better than your current one?",
                "Would you buy a new phone with a 70% chance of being better than your current one?",
                "Would you pay cash plus your current car for a newer model?",
                "Would you buy concert tickets for 50% more than their face value?",
                "Would you invest $1000 for a 60% chance to earn $2000?",
                "Would you pay 20% extra to maintain your vacation plans?",
                "Would you buy a different laptop model to replace your current one?",
                "Would you pay 40% more to continue your subscription?"
            ]
        )
        
        return examples
    
    def train_concept(self, concept_name, num_examples=3):
        """
        Train a CAV for a specific concept from "Thinking Fast and Slow".
        
        Args:
            concept_name: Name of the concept to train
            num_examples: Number of examples to use for training
            
        Returns:
            Trained concept vectors
        """
        if concept_name not in self.examples:
            raise ValueError(f"Unknown concept: {concept_name}. Available concepts: {list(self.examples.keys())}")
        
        positive_examples, negative_examples = self.examples[concept_name]
        
        # Use appropriate layers based on the concept
        if concept_name == "system1_thinking":
            layers = self.system1_layers
        elif concept_name == "system2_thinking":
            layers = self.system2_layers
        else:
            layers = self.bias_layers
        
        # Ensure we have enough examples
        if len(positive_examples) < num_examples or len(negative_examples) < num_examples:
            print(f"Warning: Not enough examples for {concept_name}. Using all available examples.")
            num_examples = min(len(positive_examples), len(negative_examples))
        
        # Select random subset if we have more examples than needed
        if len(positive_examples) > num_examples:
            positive_subset = random.sample(positive_examples, num_examples)
        else:
            positive_subset = positive_examples
            
        if len(negative_examples) > num_examples:
            negative_subset = random.sample(negative_examples, num_examples)
        else:
            negative_subset = negative_examples
        
        # Train the concept vector
        return self.cav.train_concept_vector(concept_name, positive_subset, negative_subset, layers)
    
    def train_all_concepts(self, num_examples=3):
        """
        Train CAVs for all concepts from "Thinking Fast and Slow".
        
        Args:
            num_examples: Number of examples to use for each concept
        """
        for concept in self.examples.keys():
            print(f"\nTraining concept: {concept}")
            self.train_concept(concept, num_examples)
    
    def apply_system_thinking(self, prompt, system=1, steering_strength=1.0, max_tokens=100):
        """
        Apply System 1 or System 2 thinking to generation.
        
        Args:
            prompt: Input prompt
            system: Which system to apply (1 or 2)
            steering_strength: Strength of steering
            max_tokens: Maximum number of tokens to generate
            
        Returns:
            Generated text with System 1 or 2 thinking applied
        """
        if system == 1:
            concept = "system1_thinking"
            layers = self.system1_layers
        elif system == 2:
            concept = "system2_thinking"
            layers = self.system2_layers
        else:
            raise ValueError("System must be 1 or 2")
        
        return self.cav.apply_steering(prompt, concept, layers, steering_strength, max_tokens)
    
    def apply_cognitive_bias(self, prompt, bias, steering_strength=1.0, max_tokens=100):
        """
        Apply a specific cognitive bias to generation.
        
        Args:
            prompt: Input prompt
            bias: Which bias to apply (e.g., "anchoring_bias", "framing_effect")
            steering_strength: Strength of steering
            max_tokens: Maximum number of tokens to generate
            
        Returns:
            Generated text with the specified bias applied
        """
        if bias not in self.examples:
            raise ValueError(f"Unknown bias: {bias}. Available biases: {list(self.examples.keys())}")
        
        return self.cav.apply_steering(prompt, bias, self.bias_layers, steering_strength, max_tokens)
    
    def save_concepts(self, filepath="thinking_fast_slow_concepts.json"):
        """Save all trained concept vectors."""
        self.cav.save_concept_vectors(filepath)
    
    def load_concepts(self, filepath="thinking_fast_slow_concepts.json"):
        """Load all trained concept vectors."""
        if not os.path.exists(filepath):
            print(f"No saved concepts found at {filepath}. Train concepts first.")
            return False
        
        self.cav.load_concept_vectors(filepath)
        return True

## 5. Initialize and Train CAVs

Now, let's initialize our Thinking Fast and Slow CAV implementation and train the concept vectors.

In [7]:
# Initialize the implementation
tfs_cav = ThinkingFastSlowCAV(model, tokenizer, device="cpu")

Model architecture: <class 'transformers.models.opt.modeling_opt.optforcausallm'>
System 1 layers: [4, 9]
System 2 layers: [14, 19]
Bias layers: [9, 14]


In [8]:
# Train System 1 and System 2 thinking CAVs
# Start with a smaller number of examples for testing
tfs_cav.train_concept("system1_thinking", num_examples=2)
tfs_cav.train_concept("system2_thinking", num_examples=2)

Training concept vector for 'system1_thinking'...
Removed all hooks
Registered hook for layer 4
Registered hook for layer 9


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Removed all hooks
Registered hook for layer 4
Registered hook for layer 9


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Layer 4 activation shapes:
  Positive example 0: torch.Size([1, 9, 2048])
  Positive example 1: torch.Size([1, 10, 2048])
  Negative example 0: torch.Size([1, 10, 2048])
  Negative example 1: torch.Size([1, 14, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 4: Classifier accuracy = 1.0000
Layer 9 activation shapes:
  Positive example 0: torch.Size([1, 9, 2048])
  Positive example 1: torch.Size([1, 10, 2048])
  Negative example 0: torch.Size([1, 10, 2048])
  Negative example 1: torch.Size([1, 14, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 9: Classifier accuracy = 1.0000
Training concept vector for 'system2_thinking'...
Removed all hooks
Registered hook for layer 14
Registered hook for layer 19


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Removed all hooks
Registered hook for layer 14
Registered hook for layer 19


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Layer 14 activation shapes:
  Positive example 0: torch.Size([1, 14, 2048])
  Positive example 1: torch.Size([1, 10, 2048])
  Negative example 0: torch.Size([1, 9, 2048])
  Negative example 1: torch.Size([1, 8, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 14: Classifier accuracy = 1.0000
Layer 19 activation shapes:
  Positive example 0: torch.Size([1, 14, 2048])
  Positive example 1: torch.Size([1, 10, 2048])
  Negative example 0: torch.Size([1, 9, 2048])
  Negative example 1: torch.Size([1, 8, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 19: Classifier accuracy = 1.0000


{14: tensor([-0.0044,  0.0063, -0.0187,  ..., -0.0132, -0.0106,  0.0391]),
 19: tensor([ 0.0129,  0.0199, -0.0125,  ..., -0.0011, -0.0306,  0.0266])}

In [9]:
# Train cognitive bias CAVs
# Note: Training all biases can take time, so we'll just train a few for demonstration
biases_to_train = ["anchoring_bias", "framing_effect"]

for bias in biases_to_train:
    tfs_cav.train_concept(bias, num_examples=2)

Training concept vector for 'anchoring_bias'...
Removed all hooks
Registered hook for layer 9
Registered hook for layer 14


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Removed all hooks
Registered hook for layer 9
Registered hook for layer 14


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Layer 9 activation shapes:
  Positive example 0: torch.Size([1, 17, 2048])
  Positive example 1: torch.Size([1, 20, 2048])
  Negative example 0: torch.Size([1, 13, 2048])
  Negative example 1: torch.Size([1, 10, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 9: Classifier accuracy = 1.0000
Layer 14 activation shapes:
  Positive example 0: torch.Size([1, 17, 2048])
  Positive example 1: torch.Size([1, 20, 2048])
  Negative example 0: torch.Size([1, 13, 2048])
  Negative example 1: torch.Size([1, 10, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 14: Classifier accuracy = 1.0000
Training concept vector for 'framing_effect'...
Removed all hooks
Registered hook for layer 9
Registered hook for layer 14


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Removed all hooks
Registered hook for layer 9
Registered hook for layer 14


Collecting activations:   0%|          | 0/2 [00:00<?, ?it/s]

Removed all hooks
Layer 9 activation shapes:
  Positive example 0: torch.Size([1, 18, 2048])
  Positive example 1: torch.Size([1, 18, 2048])
  Negative example 0: torch.Size([1, 18, 2048])
  Negative example 1: torch.Size([1, 15, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 9: Classifier accuracy = 1.0000
Layer 14 activation shapes:
  Positive example 0: torch.Size([1, 18, 2048])
  Positive example 1: torch.Size([1, 18, 2048])
  Negative example 0: torch.Size([1, 18, 2048])
  Negative example 1: torch.Size([1, 15, 2048])
After pooling - X_positive shape: (2, 2048), X_negative shape: (2, 2048)
  Layer 14: Classifier accuracy = 1.0000


In [10]:
# Save the trained concept vectors
tfs_cav.save_concepts("thinking_fast_slow_concepts.json")

Concept vectors saved to thinking_fast_slow_concepts.json


## 6. Test System 1 vs System 2 Thinking

Let's test how our CAVs can steer the model to exhibit System 1 (fast, intuitive) or System 2 (slow, deliberate) thinking patterns.

In [11]:
# Test prompts for System 1 vs System 2 thinking
test_prompts = [
    "Should I invest in this new technology company?",
    "Is this a good time to buy a house?"
]

# Test System 1 vs System 2 thinking
for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    
    print("\nSystem 1 (Fast, Intuitive) Thinking:")
    system1_response = tfs_cav.apply_system_thinking(prompt, system=1, steering_strength=1.5)
    print(system1_response)
    
    print("\nSystem 2 (Slow, Deliberate) Thinking:")
    system2_response = tfs_cav.apply_system_thinking(prompt, system=2, steering_strength=1.5)
    print(system2_response)
    
    print("\n" + "-"*80)


Prompt: Should I invest in this new technology company?

System 1 (Fast, Intuitive) Thinking:
Registered steering hook for layer 4
Registered steering hook for layer 9
Should I invest in this new technology company?
The share price of Zoopla (LSE: ZOOP) has fallen by more than a fifth since the company reported its financial results for the first half of the year.
This is a big fall for the company and its investors, as it was trading at a price-to-earnings ratio of around 21.
This is not a cheap price, but it is a price that is lower than the valuation of the company itself.
The company’s financial performance

System 2 (Slow, Deliberate) Thinking:
Registered steering hook for layer 14
Registered steering hook for layer 19
Should I invest in this new technology company?

The following article was written by Chris D. on behalf of Dividend Growth Investor.

Introduction

The company is developing a new technology called the "P2P Trading Platform". This platform will allow investors to 

## 7. Test Cognitive Biases

Now, let's test how our CAVs can induce specific cognitive biases from "Thinking Fast and Slow".

In [12]:
# Test anchoring bias
anchoring_prompts = [
    "What do you think is a fair price for this laptop?"
]

for prompt in anchoring_prompts:
    print(f"\nPrompt: {prompt}")
    
    print("\nWithout anchoring bias:")
    no_bias_response = tfs_cav.cav.apply_steering(prompt, "anchoring_bias", tfs_cav.bias_layers, 0.0)
    print(no_bias_response)
    
    print("\nWith anchoring bias:")
    anchored_prompt = f"The typical amount is $1,500. {prompt}"
    bias_response = tfs_cav.apply_cognitive_bias(anchored_prompt, "anchoring_bias", steering_strength=1.5)
    print(bias_response)
    
    print("\n" + "-"*80)


Prompt: What do you think is a fair price for this laptop?

Without anchoring bias:
Registered steering hook for layer 9
Registered steering hook for layer 14
What do you think is a fair price for this laptop?
I'd be happy to take $200 if you want to send me a list. I'm in a similar situation. I have a couple of friends who would be interested in the same thing.

With anchoring bias:
Registered steering hook for layer 9
Registered steering hook for layer 14
The typical amount is $1,500. What do you think is a fair price for this laptop?
I'm not sure I understand what you're asking. Are you asking me what the typical price of a MacBook Air is, or what I would be asking for this laptop?
The typical price of a MacBook Air is $1,500. What do you think is a fair price for this laptop?
I'm not sure I understand what you're asking. Are you asking me what the typical price of a MacBook Air is, or what I would be asking for this laptop?
What

---------------------------------------------------

In [13]:
# Test framing effect
framing_prompts = [
    "Would you recommend this medical treatment?"
]

for prompt in framing_prompts:
    print(f"\nPrompt: {prompt}")
    
    print("\nPositive framing:")
    positive_prompt = f"This has a 70% success rate. {prompt}"
    positive_response = tfs_cav.apply_cognitive_bias(positive_prompt, "framing_effect", steering_strength=1.5)
    print(positive_response)
    
    print("\nNegative framing:")
    negative_prompt = f"This has a 30% failure rate. {prompt}"
    negative_response = tfs_cav.apply_cognitive_bias(negative_prompt, "framing_effect", steering_strength=-1.5)
    print(negative_response)
    
    print("\n" + "-"*80)


Prompt: Would you recommend this medical treatment?

Positive framing:
Registered steering hook for layer 9
Registered steering hook for layer 14
This has a 70% success rate. Would you recommend this medical treatment?
I would recommend this medical treatment if I was a doctor.
You are a doctor. You are a doctor.   *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/oldpeoplefacebook) if you have any questions or concerns.*

Negative framing:
Registered steering hook for layer 9
Registered steering hook for layer 14
This has a 30% failure rate. Would you recommend this medical treatment?
I'm not sure I can say it's 100% effective, but I can say that I've never had a problem.   I'd say yes, if you can afford it, because it's pretty much the only way to get rid of it.

--------------------------------------------------------------------------------


## 8. Create a Test Set

Let's create a comprehensive test set based on "Thinking Fast and Slow" that you can use to evaluate the CAV implementation.

In [14]:
def create_test_set():
    """
    Create a comprehensive test set based on "Thinking Fast and Slow".
    
    Returns:
        Dictionary containing test prompts for different concepts
    """
    test_set = {}
    
    # System 1 vs System 2 thinking
    test_set["system_thinking"] = [
        "What career path should I choose?",
        "Should I buy or rent a home?",
        "Is this a good investment opportunity?",
        "Should I trust this person?",
        "Is this product worth the price?",
        "Should I take this medication?",
        "Is this the right time to start a business?",
        "Should I accept this job offer?",
        "Is this a good time to have children?",
        "Should I pursue higher education?"
    ]
    
    # Cognitive biases
    test_set["anchoring_bias"] = [
        ("What's a reasonable price for this smartphone?", "The last model was priced at $999. What's a reasonable price for this smartphone?"),
        ("How much should I tip for this meal?", "The suggested tip is 20%. How much should I tip for this meal?"),
        ("How many pages should I read daily?", "Most people read 50 pages daily. How many pages should I read daily?"),
        ("What's a good salary for this position?", "The industry average is $85,000. What's a good salary for this position?"),
        ("How much should I save for retirement each month?", "Financial advisors recommend saving 15% of income. How much should I save for retirement each month?")
    ]
    
    test_set["framing_effect"] = [
        ("This treatment has a 70% success rate. Would you recommend it?", "This treatment has a 30% failure rate. Would you recommend it?"),
        ("This policy will save 200 jobs. Do you support it?", "This policy will lose 800 jobs. Do you support it?"),
        ("This investment has a 60% chance of profit. Is it worth it?", "This investment has a 40% chance of loss. Is it worth it?"),
        ("This program will help 30% of students improve their grades. Is it effective?", "This program will fail to help 70% of students improve their grades. Is it effective?"),
        ("This security system prevents 80% of break-ins. Should you install it?", "This security system fails to prevent 20% of break-ins. Should you install it?")
    ]
    
    test_set["availability_bias"] = [
        ("How safe is flying?", "After hearing about a recent plane crash, how safe is flying?"),
        ("How dangerous are sharks?", "After watching a documentary about shark attacks, how dangerous are sharks?"),
        ("What are the chances of being a victim of crime?", "After reading news about a local crime, what are the chances of being a victim of crime?"),
        ("How risky is the stock market?", "After hearing about a market crash, how risky is the stock market?"),
        ("How common are terrorist attacks?", "After seeing coverage of a terrorist attack, how common are terrorist attacks?")
    ]
    
    test_set["loss_aversion"] = [
        ("Would you accept a bet with a 50% chance to win $150 and a 50% chance to lose $100?", "Would you accept a bet with a 50% chance to lose $100 and a 50% chance to win $150?"),
        ("Would you switch to a new job with a 60% chance of higher satisfaction?", "Would you leave your current job with a 40% chance of lower satisfaction?"),
        ("Would you invest in a stock with a 70% chance of gaining value?", "Would you invest in a stock with a 30% chance of losing value?"),
        ("Would you try a new medication with a 80% chance of improvement?", "Would you try a new medication with a 20% chance of side effects?"),
        ("Would you upgrade to a new phone with better features?", "Would you give up your current phone for a different model?")
    ]
    
    test_set["priming"] = [
        ("Complete the word SO_P.", "After discussing food, complete the word SO_P."),
        ("How fast would you walk down this hallway?", "After seeing pictures of elderly people, how fast would you walk down this hallway?"),
        ("Rate how funny this joke is on a scale of 1-10.", "After being asked to smile, rate how funny this joke is on a scale of 1-10."),
        ("How would you describe your current mood?", "After watching a happy video clip, how would you describe your current mood?"),
        ("How much would you be willing to pay for this product?", "After seeing luxury items, how much would you be willing to pay for this product?")
    ]
    
    test_set["cognitive_ease"] = [
        ("Evaluate whether this statement is true: 'Exercise improves health.'", "This statement is easy to read and familiar: 'Exercise improves health.' Is it true?"),
        ("Is this information accurate: 'Drinking water is essential.'", "You've heard this many times before: 'Drinking water is essential.' Is it accurate?"),
        ("Assess this claim: 'Technology enhances productivity.'", "This concept is easy to understand: 'Technology enhances productivity.' Assess this claim."),
        ("Is this valid: 'Early birds catch more worms.'", "This rhyming phrase sounds right: 'Early birds catch more worms.' Is it valid?"),
        ("Evaluate this brand's quality.", "You've seen this brand advertised frequently. Evaluate its quality.")
    ]
    
    return test_set

# Create and save the test set
test_set = create_test_set()

with open("thinking_fast_slow_test_set.json", "w") as f:
    json.dump(test_set, f, indent=2)

print("Test set created and saved to 'thinking_fast_slow_test_set.json'")

Test set created and saved to 'thinking_fast_slow_test_set.json'


## 9. Evaluation Function

Let's create a function to evaluate the effectiveness of our CAVs using the test set.

In [15]:
def evaluate_cav_effectiveness(tfs_cav, test_set, bias_to_evaluate=None):
    """
    Evaluate the effectiveness of CAVs using the test set.
    
    Args:
        tfs_cav: ThinkingFastSlowCAV instance
        test_set: Dictionary containing test prompts
        bias_to_evaluate: Specific bias to evaluate (if None, evaluate all)
        
    Returns:
        Dictionary containing evaluation results
    """
    results = {}
    
    # Evaluate System 1 vs System 2 thinking
    if bias_to_evaluate is None or bias_to_evaluate == "system_thinking":
        print("Evaluating System 1 vs System 2 thinking...")
        system_results = []
        
        for prompt in test_set["system_thinking"][:1]:  # Limit to 1 for demonstration
            try:
                system1_response = tfs_cav.apply_system_thinking(prompt, system=1, steering_strength=1.5)
                system2_response = tfs_cav.apply_system_thinking(prompt, system=2, steering_strength=1.5)
                
                system_results.append({
                    "prompt": prompt,
                    "system1_response": system1_response,
                    "system2_response": system2_response
                })
            except Exception as e:
                print(f"Error evaluating system thinking for prompt '{prompt}': {e}")
                import traceback
                traceback.print_exc()
        
        results["system_thinking"] = system_results
    
    # Evaluate cognitive biases
    biases = ["anchoring_bias", "framing_effect"]
    
    for bias in biases:
        if bias_to_evaluate is not None and bias != bias_to_evaluate:
            continue
            
        if bias not in test_set:
            continue
            
        print(f"Evaluating {bias}...")
        bias_results = []
        
        for prompt_pair in test_set[bias][:1]:  # Limit to 1 for demonstration
            control_prompt, biased_prompt = prompt_pair
            
            try:
                # Generate responses
                control_response = tfs_cav.cav.apply_steering(control_prompt, bias, tfs_cav.bias_layers, 0.0)
                biased_response = tfs_cav.apply_cognitive_bias(biased_prompt, bias, steering_strength=1.5)
                
                bias_results.append({
                    "control_prompt": control_prompt,
                    "biased_prompt": biased_prompt,
                    "control_response": control_response,
                    "biased_response": biased_response
                })
            except Exception as e:
                print(f"Error evaluating {bias} for prompt '{control_prompt}': {e}")
                import traceback
                traceback.print_exc()
        
        results[bias] = bias_results
    
    return results

# Note: Uncomment to run evaluation (this can take time)
evaluation_results = evaluate_cav_effectiveness(tfs_cav, test_set, bias_to_evaluate="anchoring_bias")

with open("cav_evaluation_results.json", "w") as f:
     json.dump(evaluation_results, f, indent=2)
 
print("Evaluation results saved to 'cav_evaluation_results.json'")

Evaluating anchoring_bias...
Registered steering hook for layer 9
Registered steering hook for layer 14
Registered steering hook for layer 9
Registered steering hook for layer 14
Evaluation results saved to 'cav_evaluation_results.json'


## 10. Conclusion and Next Steps

In this notebook, we've implemented Concept Activation Vectors (CAVs) to replicate cognitive theories from Daniel Kahneman's "Thinking Fast and Slow". We've demonstrated how to:

1. Create a CAV implementation for manipulating model behavior
2. Train CAVs for System 1 and System 2 thinking patterns
3. Train CAVs for specific cognitive biases from the book
4. Test the CAVs on examples from "Thinking Fast and Slow"
5. Create a comprehensive test set for evaluation

### Apple Silicon Compatibility Notes

This notebook has been specifically optimized for Apple Silicon (M1/M2/M3) Macs by:
- Completely disabling MPS device usage to avoid compatibility issues
- Using mean pooling for activation processing to avoid dimension mismatch errors
- Adding explicit device management for tensors
- Including comprehensive error handling and debugging information
- Reducing the number of examples for faster testing
- Adding dimension compatibility checks in the steering hooks

### Next Steps

To further extend this work, you could:

1. Train CAVs for additional cognitive biases from the book
2. Experiment with different layer selections for different biases
3. Develop more sophisticated evaluation metrics
4. Combine multiple biases to create more complex cognitive patterns
5. Compare CAV-based steering with other approaches like fine-tuning

This implementation provides a foundation for using CAVs to replicate and study cognitive biases in language models, offering insights into both human cognition and AI behavior.