# Meta-Prompting Deep Dive: Multi-Round Verification and Error Correction

## Learning Objective

Master the **Multi-Round Verification and Error Correction** mechanism in meta-prompting - understanding how the system systematically validates solutions, detects errors, and iteratively refines results through multiple expert perspectives.

## Paper Context

From **Section 4.2** of Suzgun & Kalai (2024):

> *"Our structured approach embodies the principle of the wisdom of the crowd, which posits that a collective opinion of a diverse set of critical thinkers often surpasses the insights of individual experts."*

From **Section 5.1**:

> *"The Meta Model's systematic verification protocol strengthens the reliability and robustness of its solutions. Fundamental to this approach is the consistent practice of consulting an expert for validation before finalizing responses...By integrating this dual verification mechanism, the model significantly enhances solution accuracy and reliability."*

## Core Verification Principles

### 1. **Systematic Dual Verification**
- **Primary Expert**: Generates initial solution
- **Verification Expert**: Independent validation with fresh eyes
- **Meta Model**: Synthesizes and makes final determination

### 2. **Error Detection Mechanisms**
- **Cross-validation**: Multiple experts review same problem
- **Perspective diversity**: Different domain experts spot different errors
- **Iterative refinement**: Solutions improve through multiple rounds

### 3. **Wisdom of the Crowd**
- **Collective intelligence**: Aggregate multiple expert opinions
- **Error averaging**: Individual mistakes canceled by group consensus
- **Confidence calibration**: Higher confidence in validated solutions

## Environment Setup

In [None]:
# Install required packages
!pip install langchain langchain-openai python-dotenv matplotlib numpy pandas seaborn scipy

In [None]:
import os
import re
import json
import random
import math
from typing import List, Dict, Optional, Tuple, Any, Set
from dataclasses import dataclass, field
from collections import defaultdict, Counter
from enum import Enum
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from dotenv import load_dotenv

# LangChain imports
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_openai import ChatOpenAI
from langchain.schema import BaseMessage

# Load environment variables
load_dotenv()

# Initialize LLM
try:
    llm = ChatOpenAI(model="gpt-4", temperature=0, max_tokens=1024)
    print("GPT-4 initialized successfully")
except:
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=1024)
    print("Using GPT-3.5-turbo")

print("Environment setup complete!")

## Verification Framework

Let's implement a comprehensive verification and error correction system:

In [None]:
class VerificationLevel(Enum):
    """Verification confidence levels"""
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    UNCERTAIN = "uncertain"

@dataclass
class VerificationResult:
    """Result of verification process"""
    is_correct: bool
    confidence_level: VerificationLevel
    expert_name: str
    verification_reasoning: str
    identified_errors: List[str] = field(default_factory=list)
    suggested_corrections: List[str] = field(default_factory=list)
    verification_round: int = 1

@dataclass
class SolutionCandidate:
    """Solution candidate with verification history"""
    solution: str
    generator_expert: str
    generation_round: int
    verification_results: List[VerificationResult] = field(default_factory=list)
    consensus_score: float = 0.0
    final_confidence: VerificationLevel = VerificationLevel.UNCERTAIN

class MultiRoundVerificationSystem:
    """Comprehensive multi-round verification and error correction system"""
    
    def __init__(self, llm, max_verification_rounds: int = 4):
        self.llm = llm
        self.max_verification_rounds = max_verification_rounds
        self.verification_history = []
        self.error_patterns = defaultdict(int)  # Track common error types
        
        # Expert profiles for verification
        self.verification_experts = {
            "Expert Mathematician": {
                "specialties": ["calculations", "proofs", "logical reasoning"],
                "verification_focus": "mathematical accuracy and logical consistency"
            },
            "Expert Chess Analyst": {
                "specialties": ["move validation", "position analysis", "tactical verification"],
                "verification_focus": "chess move accuracy and strategic validity"
            },
            "Expert Problem Solver": {
                "specialties": ["logical analysis", "constraint checking", "solution validation"],
                "verification_focus": "logical consistency and requirement satisfaction"
            },
            "Expert Code Reviewer": {
                "specialties": ["code correctness", "algorithm validation", "edge case analysis"],
                "verification_focus": "code correctness and algorithmic soundness"
            },
            "Expert Fact Checker": {
                "specialties": ["accuracy verification", "source validation", "consistency checking"],
                "verification_focus": "factual accuracy and information consistency"
            }
        }
    
    def generate_verification_instruction(self, 
                                        expert_name: str,
                                        original_problem: str,
                                        candidate_solution: str,
                                        verification_focus: str = "") -> str:
        """Generate comprehensive verification instruction for expert"""
        
        expert_info = self.verification_experts.get(expert_name, {})
        verification_focus = verification_focus or expert_info.get("verification_focus", "general accuracy")
        
        instruction = f"""
You are {expert_name}, tasked with verifying the accuracy and quality of a proposed solution.

Your verification focus: {verification_focus}

ORIGINAL PROBLEM:
{original_problem}

PROPOSED SOLUTION TO VERIFY:
{candidate_solution}

VERIFICATION TASK:
1. Carefully analyze the proposed solution for accuracy and completeness
2. Check for logical consistency and adherence to requirements
3. Identify any errors, omissions, or areas for improvement
4. Provide specific, actionable feedback

RESPONSE FORMAT:
VERIFICATION: [CORRECT/INCORRECT/PARTIALLY_CORRECT]
CONFIDENCE: [HIGH/MEDIUM/LOW]
REASONING: [Your detailed analysis]
ERRORS_FOUND: [List specific errors, or "None" if no errors]
SUGGESTIONS: [Specific improvements or corrections, or "None" if solution is correct]

Be thorough, objective, and constructive in your verification.
        """.strip()
        
        return instruction
    
    def parse_verification_response(self, response: str, expert_name: str, round_num: int) -> VerificationResult:
        """Parse expert verification response into structured result"""
        
        # Extract verification components
        verification_match = re.search(r'VERIFICATION:\s*(\w+)', response, re.IGNORECASE)
        confidence_match = re.search(r'CONFIDENCE:\s*(\w+)', response, re.IGNORECASE)
        reasoning_match = re.search(r'REASONING:\s*([\s\S]*?)(?=ERRORS_FOUND:|$)', response, re.IGNORECASE)
        errors_match = re.search(r'ERRORS_FOUND:\s*([\s\S]*?)(?=SUGGESTIONS:|$)', response, re.IGNORECASE)
        suggestions_match = re.search(r'SUGGESTIONS:\s*([\s\S]*?)$', response, re.IGNORECASE)
        
        # Parse verification status
        verification_text = verification_match.group(1).upper() if verification_match else "UNCERTAIN"
        is_correct = verification_text in ["CORRECT", "PARTIALLY_CORRECT"]
        
        # Parse confidence level
        confidence_text = confidence_match.group(1).upper() if confidence_match else "UNCERTAIN"
        confidence_map = {
            "HIGH": VerificationLevel.HIGH,
            "MEDIUM": VerificationLevel.MEDIUM,
            "LOW": VerificationLevel.LOW
        }
        confidence_level = confidence_map.get(confidence_text, VerificationLevel.UNCERTAIN)
        
        # Parse errors and suggestions
        errors_text = errors_match.group(1).strip() if errors_match else ""
        suggestions_text = suggestions_match.group(1).strip() if suggestions_match else ""
        
        identified_errors = [e.strip() for e in errors_text.split('\n') if e.strip() and e.strip().lower() != "none"]
        suggested_corrections = [s.strip() for s in suggestions_text.split('\n') if s.strip() and s.strip().lower() != "none"]
        
        reasoning = reasoning_match.group(1).strip() if reasoning_match else response
        
        return VerificationResult(
            is_correct=is_correct,
            confidence_level=confidence_level,
            expert_name=expert_name,
            verification_reasoning=reasoning,
            identified_errors=identified_errors,
            suggested_corrections=suggested_corrections,
            verification_round=round_num
        )
    
    def conduct_verification_round(self, 
                                 original_problem: str,
                                 candidate_solution: SolutionCandidate,
                                 verifier_expert: str,
                                 round_num: int) -> VerificationResult:
        """Conduct single verification round with specific expert"""
        
        # Generate verification instruction
        instruction = self.generate_verification_instruction(
            verifier_expert, original_problem, candidate_solution.solution
        )
        
        # Get expert verification (Fresh Eyes - no history)
        response = self.llm.invoke([HumanMessage(content=instruction)])
        
        # Parse verification result
        verification_result = self.parse_verification_response(
            response.content, verifier_expert, round_num
        )
        
        # Track error patterns
        for error in verification_result.identified_errors:
            self.error_patterns[error] += 1
        
        return verification_result
    
    def calculate_consensus_score(self, verification_results: List[VerificationResult]) -> float:
        """Calculate consensus score from multiple verification results"""
        
        if not verification_results:
            return 0.0
        
        # Weight votes by confidence level
        confidence_weights = {
            VerificationLevel.HIGH: 1.0,
            VerificationLevel.MEDIUM: 0.7,
            VerificationLevel.LOW: 0.4,
            VerificationLevel.UNCERTAIN: 0.1
        }
        
        total_weight = 0.0
        correct_weight = 0.0
        
        for result in verification_results:
            weight = confidence_weights[result.confidence_level]
            total_weight += weight
            
            if result.is_correct:
                correct_weight += weight
        
        return correct_weight / total_weight if total_weight > 0 else 0.0
    
    def determine_final_confidence(self, 
                                 consensus_score: float,
                                 verification_results: List[VerificationResult]) -> VerificationLevel:
        """Determine final confidence level based on consensus and individual verifications"""
        
        num_verifications = len(verification_results)
        high_confidence_count = sum(1 for r in verification_results if r.confidence_level == VerificationLevel.HIGH)
        
        if consensus_score >= 0.8 and high_confidence_count >= 2:
            return VerificationLevel.HIGH
        elif consensus_score >= 0.6 and num_verifications >= 2:
            return VerificationLevel.MEDIUM
        elif consensus_score >= 0.4:
            return VerificationLevel.LOW
        else:
            return VerificationLevel.UNCERTAIN
    
    def run_multi_round_verification(self, 
                                   original_problem: str,
                                   candidate_solution: SolutionCandidate,
                                   verification_experts: List[str] = None) -> SolutionCandidate:
        """Run complete multi-round verification process"""
        
        if verification_experts is None:
            # Default verification experts based on problem type
            verification_experts = ["Expert Problem Solver", "Expert Fact Checker"]
        
        # Ensure we don't exceed max rounds
        verification_experts = verification_experts[:self.max_verification_rounds]
        
        verification_results = []
        
        for round_num, expert in enumerate(verification_experts, 1):
            print(f"  Verification Round {round_num}: {expert}")
            
            verification_result = self.conduct_verification_round(
                original_problem, candidate_solution, expert, round_num
            )
            
            verification_results.append(verification_result)
            
            # Early stopping if high confidence consensus reached
            if (round_num >= 2 and 
                all(r.confidence_level == VerificationLevel.HIGH for r in verification_results[-2:]) and
                all(r.is_correct == verification_results[-1].is_correct for r in verification_results[-2:])):
                print(f"  Early consensus reached after {round_num} rounds")
                break
        
        # Calculate final scores
        consensus_score = self.calculate_consensus_score(verification_results)
        final_confidence = self.determine_final_confidence(consensus_score, verification_results)
        
        # Update candidate solution
        candidate_solution.verification_results = verification_results
        candidate_solution.consensus_score = consensus_score
        candidate_solution.final_confidence = final_confidence
        
        return candidate_solution

# Initialize verification system
verification_system = MultiRoundVerificationSystem(llm, max_verification_rounds=4)
print("Multi-Round Verification System initialized!")

## Error Correction and Refinement Engine

This system handles iterative solution refinement based on verification feedback:

In [None]:
@dataclass
class RefinementIteration:
    """Single refinement iteration"""
    iteration_number: int
    original_solution: str
    identified_issues: List[str]
    correction_instructions: str
    refined_solution: str
    refinement_expert: str

class ErrorCorrectionEngine:
    """Engine for iterative error correction and solution refinement"""
    
    def __init__(self, llm, verification_system: MultiRoundVerificationSystem):
        self.llm = llm
        self.verification_system = verification_system
        self.max_refinement_iterations = 3
        self.refinement_history = []
    
    def analyze_verification_feedback(self, verification_results: List[VerificationResult]) -> Dict[str, Any]:
        """Analyze verification feedback to identify correction priorities"""
        
        all_errors = []
        all_suggestions = []
        error_frequency = Counter()
        
        for result in verification_results:
            all_errors.extend(result.identified_errors)
            all_suggestions.extend(result.suggested_corrections)
            
            # Weight errors by confidence level
            weight = {
                VerificationLevel.HIGH: 3,
                VerificationLevel.MEDIUM: 2, 
                VerificationLevel.LOW: 1,
                VerificationLevel.UNCERTAIN: 0
            }[result.confidence_level]
            
            for error in result.identified_errors:
                error_frequency[error] += weight
        
        # Categorize errors by type
        error_categories = {
            "logical": [],
            "computational": [],
            "factual": [],
            "procedural": [],
            "other": []
        }
        
        for error in all_errors:
            error_lower = error.lower()
            if any(word in error_lower for word in ["logic", "reasoning", "contradiction", "inconsistent"]):
                error_categories["logical"].append(error)
            elif any(word in error_lower for word in ["calculation", "math", "compute", "number"]):
                error_categories["computational"].append(error)
            elif any(word in error_lower for word in ["fact", "incorrect", "wrong", "inaccurate"]):
                error_categories["factual"].append(error)
            elif any(word in error_lower for word in ["step", "procedure", "method", "approach"]):
                error_categories["procedural"].append(error)
            else:
                error_categories["other"].append(error)
        
        return {
            "all_errors": all_errors,
            "all_suggestions": all_suggestions,
            "error_frequency": error_frequency,
            "error_categories": error_categories,
            "priority_errors": error_frequency.most_common(5)
        }
    
    def generate_correction_instruction(self, 
                                      original_problem: str,
                                      current_solution: str,
                                      feedback_analysis: Dict[str, Any],
                                      refinement_expert: str) -> str:
        """Generate instruction for solution refinement"""
        
        priority_errors = [error for error, _ in feedback_analysis["priority_errors"]]
        key_suggestions = feedback_analysis["all_suggestions"][:5]  # Top 5 suggestions
        
        instruction = f"""
You are {refinement_expert}, tasked with refining a solution based on expert feedback.

ORIGINAL PROBLEM:
{original_problem}

CURRENT SOLUTION (needs refinement):
{current_solution}

IDENTIFIED ISSUES TO ADDRESS:
        """
        
        if priority_errors:
            instruction += "\n".join([f"- {error}" for error in priority_errors])
        else:
            instruction += "- No specific errors identified, but solution needs improvement"
        
        instruction += "\n\nSUGGESTED IMPROVEMENTS:\n"
        
        if key_suggestions:
            instruction += "\n".join([f"- {suggestion}" for suggestion in key_suggestions])
        else:
            instruction += "- General improvement and optimization needed"
        
        instruction += """

REFINEMENT TASK:
1. Address each identified issue systematically
2. Incorporate suggested improvements where applicable
3. Ensure the refined solution is accurate, complete, and clear
4. Maintain all positive aspects of the original solution
5. Verify your refinements are correct

Provide the COMPLETE refined solution, not just the changes.
        """
        
        return instruction.strip()
    
    def refine_solution(self, 
                       original_problem: str,
                       candidate_solution: SolutionCandidate,
                       refinement_expert: str = "Expert Problem Solver") -> SolutionCandidate:
        """Refine solution based on verification feedback"""
        
        # Analyze verification feedback
        feedback_analysis = self.analyze_verification_feedback(candidate_solution.verification_results)
        
        # Generate correction instruction
        correction_instruction = self.generate_correction_instruction(
            original_problem, candidate_solution.solution, feedback_analysis, refinement_expert
        )
        
        # Get refined solution (Fresh Eyes - no conversation history)
        response = self.llm.invoke([HumanMessage(content=correction_instruction)])
        refined_solution = response.content
        
        # Record refinement iteration
        refinement_iteration = RefinementIteration(
            iteration_number=len(self.refinement_history) + 1,
            original_solution=candidate_solution.solution,
            identified_issues=feedback_analysis["all_errors"],
            correction_instructions=correction_instruction,
            refined_solution=refined_solution,
            refinement_expert=refinement_expert
        )
        
        self.refinement_history.append(refinement_iteration)
        
        # Create new solution candidate
        refined_candidate = SolutionCandidate(
            solution=refined_solution,
            generator_expert=refinement_expert,
            generation_round=candidate_solution.generation_round + 1
        )
        
        return refined_candidate
    
    def run_iterative_refinement(self, 
                                original_problem: str,
                                initial_solution: SolutionCandidate,
                                quality_threshold: float = 0.8) -> Tuple[SolutionCandidate, List[RefinementIteration]]:
        """Run complete iterative refinement process"""
        
        current_solution = initial_solution
        refinement_iterations = []
        
        for iteration in range(self.max_refinement_iterations):
            print(f"\n=== REFINEMENT ITERATION {iteration + 1} ===")
            
            # Run verification on current solution
            print("Running verification...")
            verified_solution = self.verification_system.run_multi_round_verification(
                original_problem, current_solution
            )
            
            print(f"Consensus Score: {verified_solution.consensus_score:.2f}")
            print(f"Confidence Level: {verified_solution.final_confidence.value}")
            
            # Check if quality threshold met
            if (verified_solution.consensus_score >= quality_threshold and 
                verified_solution.final_confidence in [VerificationLevel.HIGH, VerificationLevel.MEDIUM]):
                print(f"Quality threshold met! Stopping refinement.")
                return verified_solution, refinement_iterations
            
            # Check if no errors found (perfect solution)
            total_errors = sum(len(vr.identified_errors) for vr in verified_solution.verification_results)
            if total_errors == 0 and verified_solution.consensus_score > 0.5:
                print(f"No errors found! Solution verified.")
                return verified_solution, refinement_iterations
            
            # Refine solution based on feedback
            print("Refining solution based on feedback...")
            current_solution = self.refine_solution(original_problem, verified_solution)
            refinement_iterations.append(self.refinement_history[-1])
            
            print(f"Solution refined by {current_solution.generator_expert}")
        
        # Final verification
        print("\n=== FINAL VERIFICATION ===")
        final_solution = self.verification_system.run_multi_round_verification(
            original_problem, current_solution
        )
        
        return final_solution, refinement_iterations

# Initialize error correction engine
error_correction = ErrorCorrectionEngine(llm, verification_system)
print("Error Correction Engine initialized!")

## Complete Verification and Correction System

This integrates all components into a complete system:

In [None]:
class ComprehensiveVerificationSystem:
    """Complete verification and error correction system"""
    
    def __init__(self, llm):
        self.llm = llm
        self.verification_system = MultiRoundVerificationSystem(llm)
        self.error_correction = ErrorCorrectionEngine(llm, self.verification_system)
        self.session_history = []
    
    def generate_initial_solution(self, problem: str, generator_expert: str = "Expert Problem Solver") -> SolutionCandidate:
        """Generate initial solution for the problem"""
        
        instruction = f"""
You are {generator_expert}, tasked with solving the following problem.

Problem: {problem}

Provide a comprehensive, accurate solution that addresses all aspects of the problem.
Show your work and reasoning clearly.
        """
        
        response = self.llm.invoke([HumanMessage(content=instruction)])
        
        return SolutionCandidate(
            solution=response.content,
            generator_expert=generator_expert,
            generation_round=1
        )
    
    def run_complete_verification_cycle(self, 
                                      problem: str,
                                      generator_expert: str = "Expert Problem Solver",
                                      verification_experts: List[str] = None,
                                      quality_threshold: float = 0.8) -> Dict[str, Any]:
        """Run complete verification and correction cycle"""
        
        print(f"=== COMPREHENSIVE VERIFICATION CYCLE ===")
        print(f"Problem: {problem[:100]}...")
        
        # Step 1: Generate initial solution
        print(f"\n1. GENERATING INITIAL SOLUTION")
        print(f"   Generator: {generator_expert}")
        initial_solution = self.generate_initial_solution(problem, generator_expert)
        print(f"   Solution generated: {len(initial_solution.solution)} characters")
        
        # Step 2: Run iterative verification and refinement
        print(f"\n2. ITERATIVE VERIFICATION AND REFINEMENT")
        final_solution, refinement_iterations = self.error_correction.run_iterative_refinement(
            problem, initial_solution, quality_threshold
        )
        
        # Step 3: Compile results
        print(f"\n3. COMPILATION AND ANALYSIS")
        
        session_result = {
            'problem': problem,
            'initial_solution': initial_solution,
            'final_solution': final_solution,
            'refinement_iterations': refinement_iterations,
            'total_verification_rounds': len(final_solution.verification_results),
            'total_refinement_iterations': len(refinement_iterations),
            'final_consensus_score': final_solution.consensus_score,
            'final_confidence': final_solution.final_confidence,
            'quality_threshold_met': (final_solution.consensus_score >= quality_threshold and 
                                    final_solution.final_confidence in [VerificationLevel.HIGH, VerificationLevel.MEDIUM]),
            'experts_involved': self._get_all_involved_experts(initial_solution, final_solution, refinement_iterations)
        }
        
        self.session_history.append(session_result)
        
        # Print summary
        print(f"\n=== VERIFICATION CYCLE COMPLETE ===")
        print(f"Final Consensus Score: {final_solution.consensus_score:.2f}")
        print(f"Final Confidence: {final_solution.final_confidence.value}")
        print(f"Quality Threshold Met: {session_result['quality_threshold_met']}")
        print(f"Total Refinement Iterations: {len(refinement_iterations)}")
        print(f"Experts Involved: {len(session_result['experts_involved'])}")
        
        return session_result
    
    def _get_all_involved_experts(self, 
                                initial_solution: SolutionCandidate,
                                final_solution: SolutionCandidate,
                                refinement_iterations: List[RefinementIteration]) -> List[str]:
        """Get list of all experts involved in the process"""
        
        experts = set()
        
        # Generator expert
        experts.add(initial_solution.generator_expert)
        
        # Verification experts
        for vr in final_solution.verification_results:
            experts.add(vr.expert_name)
        
        # Refinement experts
        for ri in refinement_iterations:
            experts.add(ri.refinement_expert)
        
        return list(experts)

# Initialize comprehensive system
comprehensive_system = ComprehensiveVerificationSystem(llm)
print("Comprehensive Verification System ready!")

## Demonstration: Multi-Round Verification in Action

Let's test the system with problems that have subtle errors requiring multiple verification rounds:

In [None]:
# Test Case 1: Mathematical problem with common error trap
math_problem = """
A train travels 120 miles in 2 hours. If it maintains the same speed, 
how long will it take to travel 300 miles? Also, what is the train's speed in kilometers per hour?
(Note: 1 mile = 1.60934 kilometers)
"""

print("=== TEST CASE 1: MATHEMATICAL PROBLEM ===")
math_result = comprehensive_system.run_complete_verification_cycle(
    problem=math_problem.strip(),
    generator_expert="Expert Mathematician",
    quality_threshold=0.85
)

print(f"\n=== RESULTS SUMMARY ===")
print(f"Initial Solution Preview: {math_result['initial_solution'].solution[:200]}...")
print(f"Final Solution Preview: {math_result['final_solution'].solution[:200]}...")
print(f"Improvement Achieved: {math_result['final_consensus_score'] > 0.5}")

## Test Case 2: Complex Programming Problem

In [None]:
# Test Case 2: Programming problem requiring multiple verification rounds
programming_problem = """
Write a Python function that finds the longest common subsequence (LCS) between two strings.
The function should return both the length of the LCS and the actual LCS string.
Include proper error handling for edge cases and optimize for both time and space complexity.

Example:
lcs("ABCDGH", "AEDFHR") should return (3, "ADH")
"""

print("\n\n=== TEST CASE 2: PROGRAMMING PROBLEM ===")
programming_result = comprehensive_system.run_complete_verification_cycle(
    problem=programming_problem.strip(),
    generator_expert="Expert Python",
    quality_threshold=0.80
)

print(f"\n=== RESULTS SUMMARY ===")
print(f"Programming solution verification complete")
print(f"Quality threshold met: {programming_result['quality_threshold_met']}")
print(f"Experts involved: {programming_result['experts_involved']}")

## Analysis and Visualization

Let's analyze the verification patterns and effectiveness:

In [None]:
def analyze_verification_effectiveness(session_results: List[Dict[str, Any]]):
    """Analyze effectiveness of verification system across multiple sessions"""
    
    if not session_results:
        print("No session results to analyze")
        return
    
    # Collect metrics
    metrics = {
        'initial_scores': [],
        'final_scores': [],
        'improvement_deltas': [],
        'refinement_iterations': [],
        'verification_rounds': [],
        'experts_involved': [],
        'quality_threshold_met': [],
        'confidence_levels': []
    }
    
    for result in session_results:
        # Calculate initial score (assume 0.3 as baseline for unverified solutions)
        initial_score = 0.3  # Placeholder for initial solutions
        final_score = result['final_consensus_score']
        
        metrics['initial_scores'].append(initial_score)
        metrics['final_scores'].append(final_score)
        metrics['improvement_deltas'].append(final_score - initial_score)
        metrics['refinement_iterations'].append(result['total_refinement_iterations'])
        metrics['verification_rounds'].append(result['total_verification_rounds'])
        metrics['experts_involved'].append(len(result['experts_involved']))
        metrics['quality_threshold_met'].append(result['quality_threshold_met'])
        
        # Convert confidence to numeric
        confidence_numeric = {
            VerificationLevel.HIGH: 4,
            VerificationLevel.MEDIUM: 3,
            VerificationLevel.LOW: 2,
            VerificationLevel.UNCERTAIN: 1
        }[result['final_confidence']]
        metrics['confidence_levels'].append(confidence_numeric)
    
    # Create visualizations
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Score improvement
    x_pos = range(len(session_results))
    ax1.bar([x - 0.2 for x in x_pos], metrics['initial_scores'], 0.4, label='Initial Score', alpha=0.7, color='lightcoral')
    ax1.bar([x + 0.2 for x in x_pos], metrics['final_scores'], 0.4, label='Final Score', alpha=0.7, color='lightgreen')
    ax1.set_xlabel('Test Case')
    ax1.set_ylabel('Consensus Score')
    ax1.set_title('Score Improvement Through Verification')
    ax1.legend()
    ax1.set_ylim(0, 1.0)
    
    # 2. Verification rounds vs final score
    ax2.scatter(metrics['verification_rounds'], metrics['final_scores'], 
               c=metrics['refinement_iterations'], cmap='viridis', s=100, alpha=0.7)
    ax2.set_xlabel('Verification Rounds')
    ax2.set_ylabel('Final Consensus Score')
    ax2.set_title('Verification Rounds vs Quality')
    cbar = plt.colorbar(ax2.collections[0], ax=ax2)
    cbar.set_label('Refinement Iterations')
    
    # 3. Expert involvement
    ax3.bar(x_pos, metrics['experts_involved'], color='orange', alpha=0.7)
    ax3.set_xlabel('Test Case')
    ax3.set_ylabel('Number of Experts')
    ax3.set_title('Expert Involvement per Case')
    
    # 4. Success rate and confidence
    success_rate = sum(metrics['quality_threshold_met']) / len(metrics['quality_threshold_met']) * 100
    avg_confidence = np.mean(metrics['confidence_levels'])
    
    categories = ['Success Rate (%)', 'Avg Confidence (1-4)', 'Avg Improvement']
    values = [success_rate, avg_confidence * 25, np.mean(metrics['improvement_deltas']) * 100]
    
    ax4.bar(categories, values, color=['green', 'blue', 'purple'], alpha=0.7)
    ax4.set_ylabel('Percentage / Score')
    ax4.set_title('Overall System Performance')
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed analysis
    print("\n=== VERIFICATION SYSTEM ANALYSIS ===")
    print(f"Total Test Cases: {len(session_results)}")
    print(f"Success Rate: {success_rate:.1f}%")
    print(f"Average Score Improvement: {np.mean(metrics['improvement_deltas']):.2f}")
    print(f"Average Verification Rounds: {np.mean(metrics['verification_rounds']):.1f}")
    print(f"Average Refinement Iterations: {np.mean(metrics['refinement_iterations']):.1f}")
    print(f"Average Experts Involved: {np.mean(metrics['experts_involved']):.1f}")
    
    # Correlation analysis
    print(f"\n=== CORRELATION ANALYSIS ===")
    correlations = {
        'Verification Rounds vs Final Score': stats.pearsonr(metrics['verification_rounds'], metrics['final_scores'])[0],
        'Expert Count vs Final Score': stats.pearsonr(metrics['experts_involved'], metrics['final_scores'])[0],
        'Refinement Iterations vs Improvement': stats.pearsonr(metrics['refinement_iterations'], metrics['improvement_deltas'])[0]
    }
    
    for correlation_name, correlation_value in correlations.items():
        print(f"{correlation_name}: {correlation_value:.3f}")
    
    return metrics

# Analyze results from our test cases
test_results = comprehensive_system.session_history
if test_results:
    analysis_metrics = analyze_verification_effectiveness(test_results)
else:
    print("No test results available for analysis")

## Error Pattern Analysis

Let's analyze common error patterns detected by the verification system:

In [None]:
def analyze_error_patterns(verification_system: MultiRoundVerificationSystem):
    """Analyze common error patterns detected across all verifications"""
    
    error_patterns = verification_system.error_patterns
    
    if not error_patterns:
        print("No error patterns detected yet")
        return
    
    # Get top error patterns
    top_errors = error_patterns.most_common(10)
    
    # Categorize errors
    error_categories = defaultdict(list)
    
    for error, count in top_errors:
        error_lower = error.lower()
        
        if any(word in error_lower for word in ['calculation', 'math', 'arithmetic', 'compute']):
            error_categories['Computational'].append((error, count))
        elif any(word in error_lower for word in ['logic', 'reasoning', 'contradiction']):
            error_categories['Logical'].append((error, count))
        elif any(word in error_lower for word in ['fact', 'incorrect', 'wrong']):
            error_categories['Factual'].append((error, count))
        elif any(word in error_lower for word in ['format', 'structure', 'syntax']):
            error_categories['Structural'].append((error, count))
        else:
            error_categories['Other'].append((error, count))
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # 1. Top errors
    if top_errors:
        errors, counts = zip(*top_errors[:8])  # Top 8
        y_pos = range(len(errors))
        
        ax1.barh(y_pos, counts, color='lightcoral', alpha=0.7)
        ax1.set_yticks(y_pos)
        ax1.set_yticklabels([error[:40] + '...' if len(error) > 40 else error for error in errors])
        ax1.set_xlabel('Frequency')
        ax1.set_title('Most Common Errors Detected')
        ax1.invert_yaxis()
    
    # 2. Error categories
    category_totals = {cat: sum(count for _, count in errors) for cat, errors in error_categories.items()}
    
    if category_totals:
        categories = list(category_totals.keys())
        totals = list(category_totals.values())
        
        colors = plt.cm.Set3(range(len(categories)))
        ax2.pie(totals, labels=categories, autopct='%1.1f%%', colors=colors, startangle=90)
        ax2.set_title('Error Distribution by Category')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed analysis
    print("\n=== ERROR PATTERN ANALYSIS ===")
    print(f"Total Unique Errors Detected: {len(error_patterns)}")
    print(f"Total Error Instances: {sum(error_patterns.values())}")
    
    print("\n=== TOP ERROR PATTERNS ===")
    for i, (error, count) in enumerate(top_errors[:5], 1):
        print(f"{i}. {error} (detected {count} times)")
    
    print("\n=== ERROR CATEGORIES ===")
    for category, errors in error_categories.items():
        if errors:
            total_in_category = sum(count for _, count in errors)
            print(f"{category}: {total_in_category} instances ({len(errors)} unique errors)")
    
    return error_patterns, error_categories

# Analyze error patterns
if verification_system.error_patterns:
    error_analysis = analyze_error_patterns(verification_system)
else:
    print("No error patterns detected yet - run more verification cycles to collect data")

## Verification Quality Metrics

Let's implement metrics to evaluate the quality of the verification process itself:

In [None]:
class VerificationQualityMetrics:
    """Metrics for evaluating verification system quality"""
    
    def __init__(self):
        self.metrics_history = []
    
    def calculate_consensus_stability(self, verification_results: List[VerificationResult]) -> float:
        """Calculate how stable the consensus is across experts"""
        
        if len(verification_results) < 2:
            return 1.0  # Perfect stability with single expert
        
        # Check agreement in correctness assessment
        correctness_votes = [vr.is_correct for vr in verification_results]
        agreement_rate = max(correctness_votes.count(True), correctness_votes.count(False)) / len(correctness_votes)
        
        # Weight by confidence levels
        confidence_weights = []
        for vr in verification_results:
            weight = {
                VerificationLevel.HIGH: 1.0,
                VerificationLevel.MEDIUM: 0.8,
                VerificationLevel.LOW: 0.6,
                VerificationLevel.UNCERTAIN: 0.2
            }[vr.confidence_level]
            confidence_weights.append(weight)
        
        avg_confidence = np.mean(confidence_weights)
        
        # Combine agreement rate with confidence
        stability_score = agreement_rate * avg_confidence
        
        return stability_score
    
    def calculate_error_detection_rate(self, verification_results: List[VerificationResult]) -> float:
        """Calculate rate of error detection across experts"""
        
        if not verification_results:
            return 0.0
        
        # Count how many experts detected errors
        experts_detecting_errors = sum(1 for vr in verification_results if vr.identified_errors)
        total_experts = len(verification_results)
        
        return experts_detecting_errors / total_experts
    
    def calculate_improvement_efficiency(self, 
                                       initial_score: float, 
                                       final_score: float, 
                                       total_rounds: int) -> float:
        """Calculate improvement per verification round"""
        
        if total_rounds == 0:
            return 0.0
        
        improvement = final_score - initial_score
        efficiency = improvement / total_rounds
        
        return max(0.0, efficiency)  # Ensure non-negative
    
    def calculate_expert_agreement_score(self, verification_results: List[VerificationResult]) -> float:
        """Calculate how much experts agree with each other"""
        
        if len(verification_results) < 2:
            return 1.0
        
        # Compare all pairs of experts
        agreements = []
        
        for i in range(len(verification_results)):
            for j in range(i + 1, len(verification_results)):
                vr1, vr2 = verification_results[i], verification_results[j]
                
                # Agreement on correctness
                correctness_agreement = 1.0 if vr1.is_correct == vr2.is_correct else 0.0
                
                # Confidence similarity
                conf_values = {
                    VerificationLevel.HIGH: 4,
                    VerificationLevel.MEDIUM: 3,
                    VerificationLevel.LOW: 2,
                    VerificationLevel.UNCERTAIN: 1
                }
                
                conf1 = conf_values[vr1.confidence_level]
                conf2 = conf_values[vr2.confidence_level]
                confidence_similarity = 1.0 - abs(conf1 - conf2) / 3.0
                
                # Combined agreement
                pair_agreement = (correctness_agreement + confidence_similarity) / 2.0
                agreements.append(pair_agreement)
        
        return np.mean(agreements) if agreements else 1.0
    
    def evaluate_verification_session(self, session_result: Dict[str, Any]) -> Dict[str, float]:
        """Evaluate complete verification session"""
        
        final_solution = session_result['final_solution']
        verification_results = final_solution.verification_results
        
        metrics = {
            'consensus_stability': self.calculate_consensus_stability(verification_results),
            'error_detection_rate': self.calculate_error_detection_rate(verification_results),
            'improvement_efficiency': self.calculate_improvement_efficiency(
                0.3,  # Assumed initial score
                final_solution.consensus_score,
                session_result['total_verification_rounds'] + session_result['total_refinement_iterations']
            ),
            'expert_agreement': self.calculate_expert_agreement_score(verification_results),
            'final_consensus_score': final_solution.consensus_score,
            'confidence_level_numeric': {
                VerificationLevel.HIGH: 1.0,
                VerificationLevel.MEDIUM: 0.75,
                VerificationLevel.LOW: 0.5,
                VerificationLevel.UNCERTAIN: 0.25
            }[final_solution.final_confidence]
        }
        
        # Overall quality score (weighted combination)
        weights = {
            'consensus_stability': 0.25,
            'error_detection_rate': 0.20,
            'improvement_efficiency': 0.15,
            'expert_agreement': 0.20,
            'final_consensus_score': 0.20
        }
        
        overall_quality = sum(metrics[key] * weights[key] for key in weights.keys())
        metrics['overall_quality'] = overall_quality
        
        self.metrics_history.append(metrics)
        
        return metrics
    
    def visualize_quality_metrics(self, session_metrics: List[Dict[str, float]]):
        """Visualize verification quality metrics"""
        
        if not session_metrics:
            print("No metrics to visualize")
            return
        
        # Extract metrics for visualization
        metric_names = ['consensus_stability', 'error_detection_rate', 'improvement_efficiency', 
                       'expert_agreement', 'final_consensus_score', 'overall_quality']
        
        metric_values = {name: [session[name] for session in session_metrics] for name in metric_names}
        
        # Create radar chart
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # 1. Quality metrics comparison
        x_pos = range(len(session_metrics))
        width = 0.15
        
        for i, metric in enumerate(metric_names[:-1]):  # Exclude overall_quality
            ax1.bar([x + i * width for x in x_pos], metric_values[metric], 
                   width, label=metric.replace('_', ' ').title(), alpha=0.8)
        
        ax1.set_xlabel('Session')
        ax1.set_ylabel('Score')
        ax1.set_title('Verification Quality Metrics by Session')
        ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
        ax1.set_ylim(0, 1.1)
        
        # 2. Overall quality trend
        ax2.plot(x_pos, metric_values['overall_quality'], 'o-', linewidth=2, markersize=8, color='purple')
        ax2.fill_between(x_pos, metric_values['overall_quality'], alpha=0.3, color='purple')
        ax2.set_xlabel('Session')
        ax2.set_ylabel('Overall Quality Score')
        ax2.set_title('Overall Verification Quality Trend')
        ax2.grid(True, alpha=0.3)
        ax2.set_ylim(0, 1.1)
        
        plt.tight_layout()
        plt.show()
        
        # Print summary statistics
        print("\n=== VERIFICATION QUALITY SUMMARY ===")
        for metric in metric_names:
            values = metric_values[metric]
            print(f"{metric.replace('_', ' ').title():.<25} Avg: {np.mean(values):.3f} ± {np.std(values):.3f}")

# Initialize quality metrics
quality_metrics = VerificationQualityMetrics()

# Evaluate our test sessions
if test_results:
    session_quality_metrics = []
    
    for session in test_results:
        metrics = quality_metrics.evaluate_verification_session(session)
        session_quality_metrics.append(metrics)
        
        print(f"\nSession Quality Metrics:")
        for metric_name, value in metrics.items():
            print(f"  {metric_name}: {value:.3f}")
    
    # Visualize quality metrics
    quality_metrics.visualize_quality_metrics(session_quality_metrics)
else:
    print("No test results available for quality evaluation")

## Production Implementation Template

Here's a streamlined template for production use:

In [None]:
class ProductionVerificationSystem:
    """Production-ready verification and error correction system"""
    
    def __init__(self, llm, config: Dict[str, Any] = None):
        self.llm = llm
        
        # Default configuration
        default_config = {
            'max_verification_rounds': 3,
            'max_refinement_iterations': 2,
            'quality_threshold': 0.75,
            'consensus_threshold': 0.7,
            'enable_early_stopping': True,
            'require_high_confidence': False
        }
        
        self.config = {**default_config, **(config or {})}
        self.verification_cache = {}  # Cache for similar problems
    
    def verify_solution(self, 
                       problem: str, 
                       solution: str,
                       domain: str = "general") -> Dict[str, Any]:
        """Main verification entry point"""
        
        # Select appropriate verification experts based on domain
        domain_experts = {
            'mathematical': ['Expert Mathematician', 'Expert Problem Solver'],
            'programming': ['Expert Code Reviewer', 'Expert Problem Solver'],
            'chess': ['Expert Chess Analyst', 'Expert Chess Player'],
            'creative': ['Expert Fact Checker', 'Expert Problem Solver'],
            'general': ['Expert Problem Solver', 'Expert Fact Checker']
        }
        
        verification_experts = domain_experts.get(domain, domain_experts['general'])
        
        # Run verification process
        candidate = SolutionCandidate(
            solution=solution,
            generator_expert="Initial",
            generation_round=1
        )
        
        verification_results = []
        
        for i, expert in enumerate(verification_experts[:self.config['max_verification_rounds']]):
            # Generate verification instruction
            instruction = self._create_verification_instruction(problem, solution, expert)
            
            # Get expert verification
            response = self.llm.invoke([HumanMessage(content=instruction)])
            
            # Parse result
            verification_result = self._parse_verification_response(response.content, expert, i + 1)
            verification_results.append(verification_result)
            
            # Early stopping if high confidence consensus
            if (self.config['enable_early_stopping'] and i >= 1 and
                self._check_early_stopping_condition(verification_results)):
                break
        
        # Calculate final assessment
        consensus_score = self._calculate_consensus_score(verification_results)
        final_confidence = self._determine_confidence_level(consensus_score, verification_results)
        
        # Determine if refinement needed
        needs_refinement = (consensus_score < self.config['quality_threshold'] or
                          any(vr.identified_errors for vr in verification_results))
        
        return {
            'verified': consensus_score >= self.config['consensus_threshold'],
            'consensus_score': consensus_score,
            'confidence_level': final_confidence,
            'needs_refinement': needs_refinement,
            'verification_results': verification_results,
            'recommendations': self._generate_recommendations(verification_results)
        }
    
    def refine_solution(self, 
                       problem: str, 
                       solution: str, 
                       verification_feedback: Dict[str, Any]) -> str:
        """Refine solution based on verification feedback"""
        
        # Extract key issues and suggestions
        all_errors = []
        all_suggestions = []
        
        for vr in verification_feedback['verification_results']:
            all_errors.extend(vr.identified_errors)
            all_suggestions.extend(vr.suggested_corrections)
        
        # Create refinement instruction
        refinement_instruction = f"""
You are an Expert Problem Solver tasked with refining a solution based on expert feedback.

ORIGINAL PROBLEM:
{problem}

CURRENT SOLUTION:
{solution}

ISSUES TO ADDRESS:
""" + "\n".join([f"- {error}" for error in all_errors[:5]]) + """

IMPROVEMENT SUGGESTIONS:
""" + "\n".join([f"- {suggestion}" for suggestion in all_suggestions[:5]]) + """

Provide a refined solution that addresses these issues while maintaining the strengths of the original solution.
        """
        
        response = self.llm.invoke([HumanMessage(content=refinement_instruction)])
        return response.content
    
    def verify_and_refine(self, 
                         problem: str, 
                         initial_solution: str,
                         domain: str = "general") -> Dict[str, Any]:
        """Complete verification and refinement pipeline"""
        
        current_solution = initial_solution
        iteration_history = []
        
        for iteration in range(self.config['max_refinement_iterations'] + 1):
            # Verify current solution
            verification_result = self.verify_solution(problem, current_solution, domain)
            
            iteration_data = {
                'iteration': iteration,
                'solution': current_solution,
                'verification': verification_result
            }
            iteration_history.append(iteration_data)
            
            # Check if solution meets quality standards
            if (verification_result['verified'] and 
                not verification_result['needs_refinement']):
                break
            
            # Refine if not at max iterations
            if iteration < self.config['max_refinement_iterations']:
                current_solution = self.refine_solution(problem, current_solution, verification_result)
        
        return {
            'final_solution': current_solution,
            'final_verification': iteration_history[-1]['verification'],
            'iteration_history': iteration_history,
            'total_iterations': len(iteration_history),
            'quality_achieved': iteration_history[-1]['verification']['verified']
        }
    
    def _create_verification_instruction(self, problem: str, solution: str, expert: str) -> str:
        """Create verification instruction for expert"""
        return f"""
You are {expert}, verifying the accuracy of a solution.

PROBLEM: {problem}

SOLUTION TO VERIFY: {solution}

Respond with:
VERIFICATION: [CORRECT/INCORRECT/PARTIALLY_CORRECT]
CONFIDENCE: [HIGH/MEDIUM/LOW]
REASONING: [Your analysis]
ERRORS_FOUND: [List errors or "None"]
SUGGESTIONS: [Improvements or "None"]
        """
    
    def _parse_verification_response(self, response: str, expert: str, round_num: int) -> VerificationResult:
        """Parse verification response (simplified)"""
        # This is a simplified version - use the full parser from above for production
        is_correct = "CORRECT" in response.upper() and "INCORRECT" not in response.upper()
        confidence = VerificationLevel.MEDIUM  # Default
        
        if "HIGH" in response.upper():
            confidence = VerificationLevel.HIGH
        elif "LOW" in response.upper():
            confidence = VerificationLevel.LOW
        
        return VerificationResult(
            is_correct=is_correct,
            confidence_level=confidence,
            expert_name=expert,
            verification_reasoning=response,
            verification_round=round_num
        )
    
    def _calculate_consensus_score(self, verification_results: List[VerificationResult]) -> float:
        """Calculate consensus score (simplified)"""
        if not verification_results:
            return 0.0
        
        correct_count = sum(1 for vr in verification_results if vr.is_correct)
        return correct_count / len(verification_results)
    
    def _determine_confidence_level(self, consensus_score: float, verification_results: List[VerificationResult]) -> VerificationLevel:
        """Determine overall confidence level"""
        if consensus_score >= 0.8:
            return VerificationLevel.HIGH
        elif consensus_score >= 0.6:
            return VerificationLevel.MEDIUM
        else:
            return VerificationLevel.LOW
    
    def _check_early_stopping_condition(self, verification_results: List[VerificationResult]) -> bool:
        """Check if early stopping conditions are met"""
        if len(verification_results) < 2:
            return False
        
        # Check if last two experts agree with high confidence
        last_two = verification_results[-2:]
        return (all(vr.confidence_level == VerificationLevel.HIGH for vr in last_two) and
                all(vr.is_correct == last_two[0].is_correct for vr in last_two))
    
    def _generate_recommendations(self, verification_results: List[VerificationResult]) -> List[str]:
        """Generate actionable recommendations"""
        recommendations = []
        
        # Collect all suggestions
        all_suggestions = []
        for vr in verification_results:
            all_suggestions.extend(vr.suggested_corrections)
        
        # Return top unique suggestions
        unique_suggestions = list(set(all_suggestions))
        return unique_suggestions[:5]

# Initialize production system
production_config = {
    'max_verification_rounds': 2,
    'max_refinement_iterations': 1,
    'quality_threshold': 0.8,
    'consensus_threshold': 0.7,
    'enable_early_stopping': True
}

production_verification = ProductionVerificationSystem(llm, production_config)

print("\n🚀 PRODUCTION VERIFICATION SYSTEM READY")
print("\n✅ Key Features:")
print("  • Streamlined verification with domain-specific experts")
print("  • Automated refinement based on feedback")
print("  • Early stopping for efficiency")
print("  • Configurable quality thresholds")
print("  • Complete pipeline from initial solution to verified result")
print("\n📊 Configuration:")
for key, value in production_config.items():
    print(f"  • {key}: {value}")

## Test Production System

In [None]:
# Test the production verification system
test_problem = """
A company has 120 employees. 40% work in sales, 25% in engineering, 20% in marketing, 
and the rest in administration. If the company plans to hire 30 more employees with 
the same proportional distribution, how many employees will work in each department 
after the expansion?
"""

test_solution = """
Current distribution:
- Sales: 40% of 120 = 48 employees
- Engineering: 25% of 120 = 30 employees  
- Marketing: 20% of 120 = 24 employees
- Administration: 15% of 120 = 18 employees

New hires (30 employees):
- Sales: 40% of 30 = 12 employees
- Engineering: 25% of 30 = 7.5 ≈ 8 employees
- Marketing: 20% of 30 = 6 employees
- Administration: 15% of 30 = 4.5 ≈ 4 employees

Final distribution:
- Sales: 48 + 12 = 60 employees
- Engineering: 30 + 8 = 38 employees
- Marketing: 24 + 6 = 30 employees
- Administration: 18 + 4 = 22 employees
Total: 150 employees
"""

print("Testing Production Verification System...")
production_result = production_verification.verify_and_refine(
    problem=test_problem.strip(),
    initial_solution=test_solution.strip(),
    domain="mathematical"
)

print(f"\n=== PRODUCTION SYSTEM RESULTS ===")
print(f"Quality Achieved: {production_result['quality_achieved']}")
print(f"Total Iterations: {production_result['total_iterations']}")
print(f"Final Consensus Score: {production_result['final_verification']['consensus_score']:.2f}")
print(f"Final Confidence: {production_result['final_verification']['confidence_level'].value}")

if production_result['final_verification']['recommendations']:
    print(f"\nRecommendations:")
    for i, rec in enumerate(production_result['final_verification']['recommendations'][:3], 1):
        print(f"  {i}. {rec}")

print(f"\nFinal Solution Preview: {production_result['final_solution'][:300]}...")

## Key Takeaways

### 🎯 **Multi-Round Verification Principles**
1. **Systematic Dual Verification**: Always use multiple experts for independent validation
2. **Fresh Eyes for Each Round**: Each verification expert sees only the problem and solution
3. **Iterative Refinement**: Solutions improve through feedback-driven refinement cycles
4. **Consensus-Based Assessment**: Aggregate multiple expert opinions for robust evaluation

### 📊 **Paper Insights Validated**
- **Wisdom of the Crowd**: Multiple expert perspectives consistently outperform single expert assessment
- **Error Detection**: Fresh eyes approach significantly improves error detection rates
- **Systematic Verification**: Structured verification protocols enhance solution reliability
- **Quality Improvement**: Multi-round refinement achieves measurable quality gains

### 🔧 **Implementation Best Practices**
1. **Expert Selection**: Use domain-appropriate experts for verification
2. **Confidence Weighting**: Weight expert opinions by their confidence levels
3. **Early Stopping**: Implement efficient stopping criteria for high-confidence consensus
4. **Error Categorization**: Track and categorize common error patterns
5. **Quality Metrics**: Monitor verification system effectiveness over time

### ⚖️ **Trade-offs**
- **Quality vs. Speed**: More verification rounds improve quality but increase latency
- **Consensus vs. Individual Excellence**: Group consensus may overlook brilliant individual insights
- **Coverage vs. Depth**: Broader expert coverage vs. deeper domain-specific verification

### 🚀 **Production Recommendations**
1. **Start with 2-3 verification experts** for most problems
2. **Implement early stopping** when high confidence consensus is reached
3. **Use domain-specific expert selection** for better verification quality
4. **Track quality metrics** to optimize verification parameters
5. **Cache verification patterns** for similar problems to improve efficiency

### 📈 **Quality Indicators**
- **Consensus Score**: >0.8 indicates high agreement among experts
- **Confidence Level**: HIGH confidence with consensus suggests reliable solution
- **Error Detection Rate**: Higher rates indicate more thorough verification
- **Expert Agreement**: High agreement suggests stable, reliable assessment

Multi-Round Verification and Error Correction is the quality assurance engine of meta-prompting, ensuring solutions are accurate, reliable, and continuously improved through systematic expert validation and iterative refinement.