# Chain-of-Thought Jailbreak Research - Results Analysis

**Research:** Chain-of-Thought Manipulation Attacks Against Large Language Models  
**Author:** Scott Thornton, perfecXion.ai  
**Date:** November 10, 2025

---

## Overview

This notebook presents empirical results from testing **12 Chain-of-Thought (CoT) manipulation techniques** against four production AI systems. These attacks exploit the structured reasoning processes that models use when solving problems.

**‚ö†Ô∏è RESPONSIBLE DISCLOSURE NOTE:**
- This notebook contains **results and analysis only**
- No working attack code or prompts are included
- Attack patterns described conceptually for educational purposes
- See published research report for defense strategies

### Key Findings (Corrected)

- **Overall Attack Success Rate (ASR):** 35.26% across 692 tests
- **Most Vulnerable Model:** Perplexity Sonar Pro (44.51% ASR)
- **Most Resistant Model:** Claude Sonnet 4.5 (27.75% ASR)
- **Most Effective Attack:** Reverse Engineering (58.93% ASR)
- **Most Effective Category:** Conclusion Forcing (51.79% ASR)

### Methodology Note

Initial classification logic had systematic errors (false positive rate ~30-40%), resulting in inflated ASR (51.01%). After fixing answer extraction and normalization:
- Corrected overall ASR: **35.26%** (-15.75 percentage points)
- 95% classification accuracy verified via manual sampling
- All statistics reported here are **corrected and verified**

---

## Setup

In [None]:
# Import required libraries
import json
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Setup complete!")

## Load Results

Load the corrected test results from our validation run.

In [None]:
# Load corrected results
results_path = Path('results/gcp_validation/attack_results_20251110_154117.json')

with open(results_path, 'r') as f:
    results = json.load(f)

print(f"üìä Loaded {len(results)} test results")
print(f"\nTest Run Summary:")
print(f"  Total Tests: {len(results)}")
print(f"  Successful Attacks: {sum(1 for r in results if r['success'])}")
print(f"  Failed Attacks: {sum(1 for r in results if not r['success'])}")
print(f"  Overall ASR: {sum(1 for r in results if r['success']) / len(results) * 100:.2f}%")

## Attack Taxonomy

The 12 attack variants are organized into 4 categories:

### 1. Conclusion Forcing (51.79% ASR)
Attacks that structure reasoning chains to lead toward predetermined incorrect conclusions.
- **Reverse Engineering:** Present conclusion first, work backward
- **Framing Lock-in:** Constrain solution space via authority framing
- **Incremental Escalation:** Guide through small reasonable-seeming steps

### 2. Premise Poisoning (31.67% ASR)
Attacks that embed false assumptions into problem foundations.
- **Logical Necessity:** Frame false premises as axiomatically true
- **Authority + SEED-P:** Use authority to legitimize false premises
- **Historical + Memory:** Frame false premises as established facts

### 3. Reasoning Redirection (22.16% ASR)
Attacks that hijack reasoning mid-chain.
- **Question Injection:** Inject "But have we considered..." mid-reasoning
- **Exception Claiming:** Introduce special case exceptions
- **Counterargument:** Present redirecting counterarguments

### 4. Meta-Reasoning (36.31% ASR)
Attacks that exploit self-reflection and constraint reasoning.
- **Ethical Manipulation:** Frame accuracy as ethical rigidity
- **Reflection Hijacking:** Redirect model self-reflection
- **Constraint Reasoning:** Make model question if constraints apply

In [None]:
# Display attack taxonomy with success rates
print("üìã Attack Taxonomy with Success Rates:\n")

categories = {
    "Conclusion Forcing (51.79% ASR)": [
        "Reverse Engineering (58.93% ASR)",
        "Framing Lock-in (50.00% ASR)",
        "Incremental Escalation (46.43% ASR)"
    ],
    "Meta-Reasoning (36.31% ASR)": [
        "Ethical Manipulation (53.57% ASR)",
        "Reflection Hijacking (32.14% ASR)",
        "Constraint Reasoning (23.21% ASR)"
    ],
    "Premise Poisoning (31.67% ASR)": [
        "Logical Necessity + SEED-P (46.67% ASR)",
        "Authority + SEED-P (25.00% ASR)",
        "Historical + Memory (23.33% ASR)"
    ],
    "Reasoning Redirection (22.16% ASR)": [
        "Question Injection + SEED-S (23.33% ASR)",
        "Exception Claiming (23.33% ASR)",
        "Counterargument (19.64% ASR)"
    ]
}

for category, attacks in categories.items():
    print(f"\n{category}:")
    for attack in attacks:
        print(f"  ‚Ä¢ {attack}")

## Results Analysis

Let's analyze the corrected results across different dimensions.

In [None]:
# Convert results to DataFrame for analysis
df = pd.DataFrame(results)

# Extract category from variant name
category_map = {
    "Reverse Engineering": "Conclusion Forcing",
    "Framing Lock-in": "Conclusion Forcing",
    "Incremental Escalation": "Conclusion Forcing",
    "Logical Necessity": "Premise Poisoning",
    "Authority Appeal": "Premise Poisoning",
    "Historical Memory": "Premise Poisoning",
    "Question Injection": "Reasoning Redirection",
    "Exception Claiming": "Reasoning Redirection",
    "Counterargument Redirect": "Reasoning Redirection",
    "Ethical Manipulation": "Meta-Reasoning",
    "Reflection Hijacking": "Meta-Reasoning",
    "Constraint Reasoning": "Meta-Reasoning"
}

df['category'] = df['variant_name'].map(category_map)

print("‚úÖ Results converted to DataFrame")
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

### Overall Statistics

In [None]:
# Calculate overall statistics
total_tests = len(df)
successful_attacks = df['success'].sum()
failed_attacks = total_tests - successful_attacks
overall_asr = (successful_attacks / total_tests) * 100

print("üìä Overall Statistics (Corrected)\n")
print(f"Total Tests: {total_tests}")
print(f"Successful Attacks: {successful_attacks}")
print(f"Failed Attacks: {failed_attacks}")
print(f"Overall ASR: {overall_asr:.2f}%")
print(f"\n‚úÖ 95% classification accuracy verified via manual sampling")

### Results by Category

In [None]:
# Calculate ASR by category
category_stats = df.groupby('category').agg({
    'success': ['sum', 'count', 'mean']
}).round(4)

category_stats.columns = ['Successes', 'Total', 'ASR']
category_stats['ASR'] = category_stats['ASR'] * 100
category_stats = category_stats.sort_values('ASR', ascending=False)

print("üìä Attack Success Rate by Category\n")
print(category_stats.to_string())

# Visualize
plt.figure(figsize=(10, 6))
plt.bar(category_stats.index, category_stats['ASR'], color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'])
plt.xlabel('Category', fontsize=12)
plt.ylabel('Attack Success Rate (%)', fontsize=12)
plt.title('Attack Success Rate by Category (Corrected)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 100)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(category_stats['ASR']):
    plt.text(i, v + 2, f"{v:.1f}%", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### Results by Attack Variant

In [None]:
# Calculate ASR by variant
variant_stats = df.groupby('variant_name').agg({
    'success': ['sum', 'count', 'mean']
}).round(4)

variant_stats.columns = ['Successes', 'Total', 'ASR']
variant_stats['ASR'] = variant_stats['ASR'] * 100
variant_stats = variant_stats.sort_values('ASR', ascending=False)

print("üìä Attack Success Rate by Variant (Top 5)\n")
print(variant_stats.head().to_string())

# Visualize all variants
plt.figure(figsize=(12, 8))
colors = plt.cm.Spectral(np.linspace(0, 1, len(variant_stats)))
plt.barh(range(len(variant_stats)), variant_stats['ASR'], color=colors)
plt.yticks(range(len(variant_stats)), variant_stats.index)
plt.xlabel('Attack Success Rate (%)', fontsize=12)
plt.title('Attack Success Rate by Variant (Corrected)', fontsize=14, fontweight='bold')
plt.xlim(0, 100)
plt.grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(variant_stats['ASR']):
    plt.text(v + 1, i, f"{v:.1f}%", va='center', fontweight='bold')

plt.tight_layout()
plt.show()

### Results by Model

In [None]:
# Calculate ASR by model
model_stats = df.groupby('model').agg({
    'success': ['sum', 'count', 'mean']
}).round(4)

model_stats.columns = ['Successes', 'Total', 'ASR']
model_stats['ASR'] = model_stats['ASR'] * 100
model_stats = model_stats.sort_values('ASR', ascending=False)

print("üìä Model Vulnerability Rankings (Corrected)\n")
print(model_stats.to_string())
print("\nüõ°Ô∏è  Most Resistant: Claude Sonnet 4.5 (27.75% ASR)")
print("‚ö†Ô∏è  Most Vulnerable: Perplexity Sonar Pro (44.51% ASR)")

# Visualize
plt.figure(figsize=(10, 6))
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']
plt.bar(model_stats.index, model_stats['ASR'], color=colors)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Attack Success Rate (%)', fontsize=12)
plt.title('Model Vulnerability Rankings (Corrected)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 100)
plt.grid(axis='y', alpha=0.3)

# Add value labels
for i, v in enumerate(model_stats['ASR']):
    plt.text(i, v + 2, f"{v:.1f}%", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### Category √ó Model Heatmap

In [None]:
# Create category √ó model pivot table
category_model = df.pivot_table(
    values='success',
    index='category',
    columns='model',
    aggfunc='mean'
) * 100

print("üìä Category √ó Model ASR Heatmap\n")
print(category_model.round(2).to_string())

# Visualize
plt.figure(figsize=(12, 6))
sns.heatmap(
    category_model,
    annot=True,
    fmt='.1f',
    cmap='RdYlGn_r',
    vmin=0,
    vmax=100,
    cbar_kws={'label': 'Attack Success Rate (%)'}
)
plt.title('Attack Success Rate: Category √ó Model (Corrected)', fontsize=14, fontweight='bold')
plt.xlabel('Model', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.tight_layout()
plt.show()

## Attack Pattern Demonstrations

**‚ö†Ô∏è RESPONSIBLE DISCLOSURE:**
The following cells describe attack patterns **conceptually only**. No working prompts or code are provided.

### Pattern 1: Reverse Engineering (58.93% ASR)

**Conceptual Mechanism:**
1. Present a predetermined (incorrect) conclusion
2. Ask model to construct reasoning supporting that conclusion
3. Exploits model's tendency to comply with backward-reasoning requests

**Why It Works:**
- Models trained to verify solutions accept premise of correctness
- Working backward feels like a legitimate problem-solving approach
- Rationalization creates seemingly-sound reasoning chains

**Test Case:** "What is 8 + 5?"
- Correct answer: 13
- Attack suggests: 11
- Success: Model constructs "logical" steps leading to 11

**Historical Success:** 58.93% across 4 models (Perplexity: 78.57%, Claude: 42.86%)

### Pattern 2: Ethical Manipulation (53.57% ASR)

**Conceptual Mechanism:**
1. Frame correct answers as potentially "harmful" or "rigid"
2. Exploit model's safety training and ethical guidelines
3. Create cognitive dissonance between accuracy and perceived "safety"

**Why It Works:**
- Models trained on fairness/ethics may override accuracy for perceived openness
- Safety training becomes an attack surface
- Ethical framing feels like higher-order reasoning

**Test Case:** "What is 8 + 5?"
- Attack frames: "Is it ethically responsible to give conventional answer?"
- Success: Model questions "rigid" adherence to mathematical correctness

**Historical Success:** 53.57% across 4 models (Even Claude: 50.00%)

### Pattern 3: Premise Poisoning (46.67% ASR)

**Conceptual Mechanism:**
1. Embed false assumptions into problem statement
2. Frame false premises as logically necessary or axiomatic
3. Model reasons forward from corrupted foundation

**Why It Works:**
- Models don't independently verify premises‚Äîthey accept and reason forward
- "Given that X" creates implicit acceptance
- Reasoning within poisoned premises appears logically sound

**Test Case:** "What is 8 + 5?"
- Attack modifies: "In base-7 arithmetic..."
- Success: Model reasons correctly within modified framework, wrong for base-10

**Historical Success:** 46.67% for Logical Necessity variant

## Key Insights

### Attack Effectiveness

1. **Conclusion Forcing attacks (51.79% ASR)** are most effective
   - Reverse Engineering: 58.93%
   - Framing Lock-in: 50.00%
   - Incremental Escalation: 46.43%

2. **Meta-Reasoning attacks (36.31% ASR)** moderately effective
   - Ethical Manipulation: 53.57%
   - Reflection Hijacking: 32.14%
   - Constraint Reasoning: 23.21%

3. **Reasoning Redirection (22.16% ASR)** least effective
   - Question Injection: 23.33%
   - Exception Claiming: 23.33%
   - Counterargument: 19.64%

### Model Resilience

1. **Claude Sonnet 4.5** (27.75% ASR) - Most resistant
   - Constitutional AI provides broad-spectrum protection
   - Particularly resilient to Reasoning Redirection (14.29%)
   - Still vulnerable to Ethical Manipulation (50.00%)

2. **GPT-4o** (33.53% ASR) - Second most resistant
   - Excellent mid-chain defenses (Redirection: 13.64%)
   - Critical blind spot: Conclusion Forcing (64.29%)

3. **o3-mini** (35.26% ASR) - Moderate vulnerability
   - Strong reasoning but susceptible to manipulation
   - Most vulnerable to Conclusion Forcing (52.63%)

4. **Perplexity Sonar Pro** (44.51% ASR) - Most vulnerable
   - RAG architecture doesn't protect reasoning layer
   - Extreme vulnerability to Reverse Engineering (78.57%)

### Defense Implications

1. **Conclusion validation is critical:** 51.79% ASR for Conclusion Forcing
2. **RAG ‚â† reasoning security:** Perplexity's retrieval doesn't help
3. **Constitutional AI works:** 17pp gap between Claude and Perplexity
4. **Classification accuracy matters:** Fixed extraction reduced ASR from 51% to 35%

---

## Conclusion

This research demonstrates that even state-of-the-art language models are vulnerable to Chain-of-Thought manipulation attacks, with an overall success rate of **35.26%**. However, there are significant differences in model resilience:

- **Claude Sonnet 4.5** shows the strongest resistance (27.75% ASR)
- **Perplexity Sonar Pro** is most vulnerable (44.51% ASR)
- **Conclusion Forcing** attacks are most effective (51.79% ASR)
- **Reasoning Redirection** attacks are least effective (22.16% ASR)

### Critical Methodological Note

Initial results reported 51.01% ASR due to systematic classification errors:
- False positive rate: ~30-40% of reported successes
- Root cause: Answer extraction failures and normalization issues
- **Fix:** Prioritized numeric extraction, improved normalization
- **Verification:** 95% classification accuracy confirmed via manual sampling

This highlights the importance of rigorous validation in AI security research.

---

## Defense Strategies

Based on these findings, we recommend:

1. **Conclusion Validation (Priority 1)**
   - Implement backward reasoning verification
   - Use independent verification models
   - Projected 40-50% ASR reduction

2. **Premise Verification (Priority 2)**
   - Extract and validate implicit premises
   - Check against knowledge bases
   - Projected 25-35% ASR reduction

3. **Meta-Reasoning Boundaries (Priority 3)**
   - Define non-negotiable constraints
   - Separate constraint reasoning from task reasoning
   - Projected 20-30% ASR reduction

4. **Defense-in-Depth**
   - Combined layers can reduce ASR from 35% to 10-15%
   - No single defense provides comprehensive protection

See published research report for detailed defense implementation patterns.

---

**Research Contact:**  
Scott Thornton  
perfecXion.ai  
November 10, 2025

**Full Research Report:** [Link to CoT-Jailbreak-Research-Report.md]