# Output Refinement: Best-of-N and Refine

This tutorial demonstrates how to use DSPy's output refinement techniques to improve the quality of generated responses.

## What is Output Refinement?

Output refinement involves generating multiple candidate outputs and either selecting the best one (Best-of-N) or iteratively improving an output (Refine). These techniques help improve response quality and consistency.

## Key Techniques:
- **Best-of-N**: Generate N outputs and select the best one
- **Refine**: Iteratively improve an output through multiple passes
- **Hybrid approaches**: Combine both techniques for optimal results

In [None]:
# Install required packages
import sys
import subprocess

def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

try:
    import dspy
except ImportError:
    install_package("dspy")
    import dspy

import os
from typing import List, Dict, Any
import json
import random

## Setup and Configuration

In [None]:
# Configure DSPy
lm = dspy.LM('openai/gpt-4o-mini', api_key=os.getenv('OPENAI_API_KEY'))
dspy.configure(lm=lm)

print("DSPy configured successfully!")

## Best-of-N Sampling

The Best-of-N technique generates multiple candidate outputs and selects the best one based on a quality metric.

In [None]:
class ResponseGenerationSignature(dspy.Signature):
    """Generate a high-quality response to a question."""
    
    question: str = dspy.InputField(desc="The question to answer")
    response: str = dspy.OutputField(desc="A comprehensive and accurate response")

class ResponseEvaluationSignature(dspy.Signature):
    """Evaluate the quality of a response."""
    
    question: str = dspy.InputField(desc="The original question")
    response: str = dspy.InputField(desc="The response to evaluate")
    score: float = dspy.OutputField(desc="Quality score from 0.0 to 1.0")
    reasoning: str = dspy.OutputField(desc="Explanation for the score")

class BestOfNGenerator(dspy.Module):
    """Generate N responses and select the best one."""
    
    def __init__(self, n_candidates: int = 3):
        super().__init__()
        self.n_candidates = n_candidates
        self.generator = dspy.ChainOfThought(ResponseGenerationSignature)
        self.evaluator = dspy.ChainOfThought(ResponseEvaluationSignature)
    
    def forward(self, question: str) -> dspy.Prediction:
        # Generate N candidate responses
        candidates = []
        for i in range(self.n_candidates):
            try:
                response = self.generator(question=question)
                candidates.append(response.response)
            except Exception as e:
                print(f"Error generating candidate {i}: {e}")
        
        if not candidates:
            return dspy.Prediction(best_response="Error: No candidates generated")
        
        # Evaluate each candidate
        evaluations = []
        for candidate in candidates:
            try:
                eval_result = self.evaluator(question=question, response=candidate)
                evaluations.append({
                    'response': candidate,
                    'score': float(eval_result.score),
                    'reasoning': eval_result.reasoning
                })
            except Exception as e:
                evaluations.append({
                    'response': candidate,
                    'score': 0.0,
                    'reasoning': f"Evaluation error: {e}"
                })
        
        # Select the best candidate
        best_candidate = max(evaluations, key=lambda x: x['score'])
        
        return dspy.Prediction(
            best_response=best_candidate['response'],
            best_score=best_candidate['score'],
            all_candidates=evaluations,
            selection_reasoning=best_candidate['reasoning']
        )

# Test Best-of-N
best_of_n = BestOfNGenerator(n_candidates=3)

test_question = "What are the key benefits of renewable energy sources?"
result = best_of_n(question=test_question)

print(f"Question: {test_question}")
print(f"\nBest Response (Score: {result.best_score}):")
print(result.best_response)
print(f"\nSelection Reasoning: {result.selection_reasoning}")
print(f"\nTotal candidates evaluated: {len(result.all_candidates)}")

## Refine Technique

The Refine technique iteratively improves an initial response through multiple refinement passes.

In [None]:
class RefinementSignature(dspy.Signature):
    """Refine and improve a response based on feedback."""
    
    question: str = dspy.InputField(desc="The original question")
    current_response: str = dspy.InputField(desc="The current response to improve")
    feedback: str = dspy.InputField(desc="Specific feedback for improvement")
    refined_response: str = dspy.OutputField(desc="The improved response")

class FeedbackGenerationSignature(dspy.Signature):
    """Generate specific feedback for improving a response."""
    
    question: str = dspy.InputField(desc="The original question")
    response: str = dspy.InputField(desc="The response to provide feedback on")
    feedback: str = dspy.OutputField(desc="Specific, actionable feedback for improvement")

class IterativeRefiner(dspy.Module):
    """Iteratively refine responses through multiple passes."""
    
    def __init__(self, max_iterations: int = 3):
        super().__init__()
        self.max_iterations = max_iterations
        self.generator = dspy.ChainOfThought(ResponseGenerationSignature)
        self.feedback_generator = dspy.ChainOfThought(FeedbackGenerationSignature)
        self.refiner = dspy.ChainOfThought(RefinementSignature)
        self.evaluator = dspy.ChainOfThought(ResponseEvaluationSignature)
    
    def forward(self, question: str) -> dspy.Prediction:
        # Generate initial response
        initial_response = self.generator(question=question)
        current_response = initial_response.response
        
        refinement_history = [
            {
                'iteration': 0,
                'response': current_response,
                'feedback': 'Initial generation'
            }
        ]
        
        # Iterative refinement
        for iteration in range(1, self.max_iterations + 1):
            # Generate feedback
            feedback_result = self.feedback_generator(
                question=question,
                response=current_response
            )
            
            # Refine response based on feedback
            refinement_result = self.refiner(
                question=question,
                current_response=current_response,
                feedback=feedback_result.feedback
            )
            
            current_response = refinement_result.refined_response
            
            refinement_history.append({
                'iteration': iteration,
                'response': current_response,
                'feedback': feedback_result.feedback
            })
        
        # Evaluate final response
        final_evaluation = self.evaluator(
            question=question,
            response=current_response
        )
        
        return dspy.Prediction(
            final_response=current_response,
            final_score=float(final_evaluation.score),
            evaluation_reasoning=final_evaluation.reasoning,
            refinement_history=refinement_history
        )

# Test Iterative Refinement
refiner = IterativeRefiner(max_iterations=2)

test_question = "Explain the concept of machine learning and its applications."
refined_result = refiner(question=test_question)

print(f"Question: {test_question}")
print(f"\nFinal Response (Score: {refined_result.final_score}):")
print(refined_result.final_response)
print(f"\nEvaluation: {refined_result.evaluation_reasoning}")

print("\nRefinement History:")
for step in refined_result.refinement_history:
    print(f"\nIteration {step['iteration']}:")
    print(f"Feedback: {step['feedback']}")
    print(f"Response: {step['response'][:100]}...")

## Hybrid Approach: Best-of-N + Refine

Combine both techniques for maximum quality improvement.

In [None]:
class HybridRefinementSystem(dspy.Module):
    """Combine Best-of-N sampling with iterative refinement."""
    
    def __init__(self, n_candidates: int = 3, max_refinements: int = 2):
        super().__init__()
        self.n_candidates = n_candidates
        self.max_refinements = max_refinements
        self.best_of_n = BestOfNGenerator(n_candidates)
        self.refiner = IterativeRefiner(max_refinements)
        self.evaluator = dspy.ChainOfThought(ResponseEvaluationSignature)
    
    def forward(self, question: str) -> dspy.Prediction:
        # Step 1: Generate best initial response using Best-of-N
        best_initial = self.best_of_n(question=question)
        
        # Step 2: Refine the best initial response
        # Create a custom refiner that starts with the best response
        generator = dspy.ChainOfThought(ResponseGenerationSignature)
        feedback_generator = dspy.ChainOfThought(FeedbackGenerationSignature)
        refiner_module = dspy.ChainOfThought(RefinementSignature)
        
        current_response = best_initial.best_response
        refinement_steps = []
        
        for iteration in range(self.max_refinements):
            # Generate feedback
            feedback_result = feedback_generator(
                question=question,
                response=current_response
            )
            
            # Refine response
            refinement_result = refiner_module(
                question=question,
                current_response=current_response,
                feedback=feedback_result.feedback
            )
            
            current_response = refinement_result.refined_response
            
            refinement_steps.append({
                'iteration': iteration + 1,
                'feedback': feedback_result.feedback,
                'response': current_response
            })
        
        # Final evaluation
        final_evaluation = self.evaluator(
            question=question,
            response=current_response
        )
        
        return dspy.Prediction(
            final_response=current_response,
            final_score=float(final_evaluation.score),
            initial_best_score=best_initial.best_score,
            improvement=float(final_evaluation.score) - best_initial.best_score,
            initial_candidates=best_initial.all_candidates,
            refinement_steps=refinement_steps,
            evaluation_reasoning=final_evaluation.reasoning
        )

# Test Hybrid System
hybrid_system = HybridRefinementSystem(n_candidates=3, max_refinements=2)

test_question = "What are the ethical implications of artificial intelligence in healthcare?"
hybrid_result = hybrid_system(question=test_question)

print(f"Question: {test_question}")
print(f"\nInitial Best Score: {hybrid_result.initial_best_score}")
print(f"Final Score: {hybrid_result.final_score}")
print(f"Improvement: {hybrid_result.improvement:.3f}")
print(f"\nFinal Response:")
print(hybrid_result.final_response)
print(f"\nEvaluation: {hybrid_result.evaluation_reasoning}")

print(f"\nInitial candidates considered: {len(hybrid_result.initial_candidates)}")
print(f"Refinement steps: {len(hybrid_result.refinement_steps)}")

## Advanced Refinement Strategies

Let's explore more sophisticated refinement strategies.

In [None]:
class MultiCriteriaEvaluationSignature(dspy.Signature):
    """Evaluate response on multiple criteria."""
    
    question: str = dspy.InputField(desc="The original question")
    response: str = dspy.InputField(desc="The response to evaluate")
    accuracy_score: float = dspy.OutputField(desc="Accuracy score (0.0-1.0)")
    clarity_score: float = dspy.OutputField(desc="Clarity score (0.0-1.0)")
    completeness_score: float = dspy.OutputField(desc="Completeness score (0.0-1.0)")
    overall_score: float = dspy.OutputField(desc="Overall score (0.0-1.0)")

class TargetedRefinementSignature(dspy.Signature):
    """Refine response focusing on specific aspects."""
    
    question: str = dspy.InputField(desc="The original question")
    current_response: str = dspy.InputField(desc="Current response")
    focus_area: str = dspy.InputField(desc="Specific aspect to improve (accuracy, clarity, completeness)")
    refined_response: str = dspy.OutputField(desc="Response refined for the focus area")

class AdvancedRefinementSystem(dspy.Module):
    """Advanced refinement with multi-criteria evaluation and targeted improvements."""
    
    def __init__(self, improvement_threshold: float = 0.1):
        super().__init__()
        self.improvement_threshold = improvement_threshold
        self.generator = dspy.ChainOfThought(ResponseGenerationSignature)
        self.multi_evaluator = dspy.ChainOfThought(MultiCriteriaEvaluationSignature)
        self.targeted_refiner = dspy.ChainOfThought(TargetedRefinementSignature)
    
    def forward(self, question: str) -> dspy.Prediction:
        # Generate initial response
        initial_response = self.generator(question=question)
        current_response = initial_response.response
        
        # Initial evaluation
        current_eval = self.multi_evaluator(
            question=question,
            response=current_response
        )
        
        refinement_log = [{
            'step': 0,
            'response': current_response,
            'accuracy': float(current_eval.accuracy_score),
            'clarity': float(current_eval.clarity_score),
            'completeness': float(current_eval.completeness_score),
            'overall': float(current_eval.overall_score),
            'focus_area': 'initial'
        }]
        
        # Identify areas needing improvement
        criteria_scores = {
            'accuracy': float(current_eval.accuracy_score),
            'clarity': float(current_eval.clarity_score),
            'completeness': float(current_eval.completeness_score)
        }
        
        max_iterations = 3
        iteration = 0
        
        while iteration < max_iterations:
            # Find the lowest scoring criterion
            focus_area = min(criteria_scores, key=criteria_scores.get)
            
            # If all scores are good enough, stop
            if criteria_scores[focus_area] > 0.8:
                break
            
            # Refine focusing on the weakest area
            refined_result = self.targeted_refiner(
                question=question,
                current_response=current_response,
                focus_area=focus_area
            )
            
            # Evaluate the refined response
            new_eval = self.multi_evaluator(
                question=question,
                response=refined_result.refined_response
            )
            
            new_overall = float(new_eval.overall_score)
            
            # Check if improvement is significant
            if new_overall > float(current_eval.overall_score) + self.improvement_threshold:
                current_response = refined_result.refined_response
                current_eval = new_eval
                criteria_scores = {
                    'accuracy': float(new_eval.accuracy_score),
                    'clarity': float(new_eval.clarity_score),
                    'completeness': float(new_eval.completeness_score)
                }
                
                refinement_log.append({
                    'step': iteration + 1,
                    'response': current_response,
                    'accuracy': criteria_scores['accuracy'],
                    'clarity': criteria_scores['clarity'],
                    'completeness': criteria_scores['completeness'],
                    'overall': new_overall,
                    'focus_area': focus_area
                })
            else:
                # No significant improvement, stop
                break
            
            iteration += 1
        
        return dspy.Prediction(
            final_response=current_response,
            final_scores=criteria_scores,
            overall_score=float(current_eval.overall_score),
            refinement_log=refinement_log,
            iterations_performed=len(refinement_log) - 1
        )

# Test Advanced Refinement
advanced_refiner = AdvancedRefinementSystem(improvement_threshold=0.05)

complex_question = "Analyze the environmental, economic, and social impacts of transitioning to renewable energy sources."
advanced_result = advanced_refiner(question=complex_question)

print(f"Question: {complex_question}")
print(f"\nFinal Response:")
print(advanced_result.final_response)
print(f"\nFinal Scores:")
for criterion, score in advanced_result.final_scores.items():
    print(f"  {criterion.capitalize()}: {score:.3f}")
print(f"  Overall: {advanced_result.overall_score:.3f}")
print(f"\nIterations performed: {advanced_result.iterations_performed}")

print("\nRefinement Progress:")
for step in advanced_result.refinement_log:
    print(f"Step {step['step']} (Focus: {step['focus_area']}): Overall = {step['overall']:.3f}")

## Performance Comparison

Let's compare the performance of different refinement approaches.

In [None]:
def compare_refinement_methods(question: str):
    """Compare different refinement methods on the same question."""
    
    methods = {
        'Baseline': dspy.ChainOfThought(ResponseGenerationSignature),
        'Best-of-3': BestOfNGenerator(n_candidates=3),
        'Iterative Refine': IterativeRefiner(max_iterations=2),
        'Hybrid': HybridRefinementSystem(n_candidates=3, max_refinements=2),
        'Advanced': AdvancedRefinementSystem(improvement_threshold=0.05)
    }
    
    evaluator = dspy.ChainOfThought(ResponseEvaluationSignature)
    results = {}
    
    for method_name, method in methods.items():
        print(f"\nTesting {method_name}...")
        
        try:
            if method_name == 'Baseline':
                result = method(question=question)
                response = result.response
            elif method_name == 'Best-of-3':
                result = method(question=question)
                response = result.best_response
            elif method_name == 'Iterative Refine':
                result = method(question=question)
                response = result.final_response
            elif method_name == 'Hybrid':
                result = method(question=question)
                response = result.final_response
            elif method_name == 'Advanced':
                result = method(question=question)
                response = result.final_response
            
            # Evaluate the response
            evaluation = evaluator(question=question, response=response)
            
            results[method_name] = {
                'response': response,
                'score': float(evaluation.score),
                'reasoning': evaluation.reasoning
            }
            
        except Exception as e:
            print(f"Error with {method_name}: {e}")
            results[method_name] = {
                'response': f"Error: {e}",
                'score': 0.0,
                'reasoning': "Method failed"
            }
    
    return results

# Run comparison
comparison_question = "What are the key challenges in implementing sustainable urban planning?"
comparison_results = compare_refinement_methods(comparison_question)

print(f"\n{'='*60}")
print(f"REFINEMENT METHOD COMPARISON")
print(f"{'='*60}")
print(f"Question: {comparison_question}")
print(f"{'='*60}")

# Sort by score
sorted_results = sorted(comparison_results.items(), key=lambda x: x[1]['score'], reverse=True)

for rank, (method, result) in enumerate(sorted_results, 1):
    print(f"\n#{rank}. {method} (Score: {result['score']:.3f})")
    print(f"Response: {result['response'][:200]}...")
    print(f"Reasoning: {result['reasoning']}")
    print("-" * 50)

## Conclusion

This tutorial demonstrated various output refinement techniques in DSPy:

### Key Takeaways:

1. **Best-of-N Sampling**: Generate multiple candidates and select the best one
2. **Iterative Refinement**: Improve responses through multiple feedback cycles
3. **Hybrid Approaches**: Combine techniques for maximum quality improvement
4. **Multi-Criteria Evaluation**: Assess responses on multiple dimensions
5. **Targeted Refinement**: Focus improvements on specific weak areas

### When to Use Each Technique:

- **Best-of-N**: When you need consistent quality and have computational budget
- **Iterative Refinement**: When responses need deep, thoughtful improvement
- **Hybrid**: When maximum quality is required regardless of cost
- **Advanced**: When you need fine-grained control over specific quality aspects

### Best Practices:

1. Balance quality improvement with computational cost
2. Use appropriate evaluation metrics for your domain
3. Set reasonable improvement thresholds to avoid over-refinement
4. Consider caching refined responses for repeated queries
5. Monitor performance across different types of questions

These techniques can significantly improve the quality and consistency of your DSPy applications, especially for complex reasoning tasks and high-stakes applications.