# Unit 3

## Automatic Instruction Optimization with DSPy

Welcome to our lesson on **Automatic Instruction Optimization** with **DSPy**\! In our previous lesson, we explored how Few-Shot Learning optimizers enhance your DSPy programs by automatically selecting and generating examples to include in your prompts. Today, we'll focus on a different approach: **optimizing the actual instructions** in your prompts.

-----

While Few-Shot Learning optimizers focus on providing examples to guide the language model, **Instruction Optimization** optimizers focus on improving the natural language instructions themselves. Instead of asking, "What examples should I show the model?" these optimizers ask, "**How should I phrase my request to the model?**"

This distinction is important because the way you phrase your instructions can **significantly impact the model's performance**, even with the same underlying task. A well-crafted instruction can guide the model to produce better outputs without needing additional examples, or it can work alongside examples to further enhance performance.

DSPy offers two powerful instruction optimizers:

  * **COPRO** (**C**ontrastive **P**rompt **O**ptimization): Generates and refines new instructions for each step in your program, optimizing them through **coordinate ascent**.
  * **MIPROv2** (**M**inimum **I**nstruction **P**rompt **O**ptimization v2): Generates instructions that are aware of both your data and any demonstrations, using **Bayesian Optimization** to efficiently search the space of possible instructions.

These optimizers are valuable when you want to keep your prompts concise (reducing token usage) or when you're working with models that respond better to clear instructions.

-----

## Understanding COPRO (Contrastive Prompt Optimization)

**COPRO** is a powerful technique for automatically improving instructions. The core idea is to generate multiple alternative instructions, evaluate them using your metric, and iteratively refine them to find the best-performing set.

The "**contrastive**" aspect comes from how it learns by comparing instructions that lead to correct outputs with those that fail.

COPRO uses **coordinate ascent** (a form of hill-climbing) to optimize instructions:

1.  **Generate Alternatives:** For each module, COPRO generates multiple alternative instructions.
2.  **Evaluate:** It evaluates each alternative using your metric and training data.
3.  **Select & Refine:** It selects the best-performing instruction and repeats the process for multiple iterations, generating new alternatives based on the current best instructions.

### Key COPRO Parameters

| Parameter | Description | Default |
| :--- | :--- | :--- |
| `prompt_model` | The language model used to generate new instruction candidates. | N/A |
| `metric` | A function that evaluates the performance of your program. | N/A |
| `breadth` | The number of new instruction candidates to generate in each iteration. | 16 |
| `depth` | The number of iterations to run the optimization process. | 2 |
| `init_temperature` | The temperature used when generating new instruction candidates (Higher = more diverse candidates). | 1.0 |
| `verbose` | Whether to print detailed information during optimization. | `False` |

COPRO is effective when you have a clear metric and want to **optimize instructions without relying on examples**.

-----

## Implementing COPRO in DSPy

Implementing COPRO involves configuring the optimizer with key parameters and using it to compile your program.

```python
from dspy.teleprompt import COPRO

# Define evaluation parameters for the compilation phase
eval_kwargs = dict(num_threads=16, display_progress=True, display_table=0)

# Create the COPRO optimizer
copro_teleprompter = COPRO(
    prompt_model=model_to_generate_prompts,  # E.g., dspy.LM('openai/gpt-4')
    metric=your_defined_metric,              # Your evaluation metric function
    breadth=num_new_prompts_generated,       # E.g., 16
    depth=times_to_generate_prompts,         # E.g., 2
    init_temperature=prompt_generation_temperature,  # E.g., 1.0
    verbose=False
)

# Compile your program with the optimizer
compiled_program_optimized_signature = copro_teleprompter.compile(
    your_dspy_program,
    trainset=trainset,
    eval_kwargs=eval_kwargs
)
```

The `compile()` process will iteratively refine the natural language instructions within your program's modules, selecting the signature that performs best on the provided `trainset` as measured by the `metric`.

-----

## Understanding MIPROv2 (Minimum Instruction Prompt Optimization)

**MIPROv2** is a more comprehensive optimizer that can optimize both **instructions and few-shot examples**. It generates instructions that are **data-aware** and **demonstration-aware**, meaning they are tailored to work effectively with the specific examples being used.

MIPROv2 uses **Bayesian Optimization** to efficiently explore the search space, often finding better instructions with fewer evaluations than COPRO's coordinate ascent.

### Key MIPROv2 Parameters

| Parameter | Description |
| :--- | :--- |
| `metric` | A function that evaluates the performance of your program. |
| `auto` | Specifies the optimization intensity: `"light"`, `"medium"`, or `"heavy"`. Lighter settings are faster for experimentation, while heavier settings perform more trials for better results. |
| `max_bootstrapped_demos` | (Used in `compile`) Maximum number of new examples to self-generate. |
| `max_labeled_demos` | (Used in `compile`) Maximum number of examples to use directly from the training set. |

MIPROv2 is particularly effective when you have a **reasonable amount of training data** (e.g., 200+ examples) and want to **optimize both instructions and examples in a unified way**.

-----

## Implementing MIPROv2 in DSPy

MIPROv2 supports both few-shot and zero-shot configurations by adjusting the `compile()` parameters.

### Few-Shot Configuration (Optimizing Instructions + Examples)

```python
from dspy.teleprompt import MIPROv2

# Create the MIPROv2 optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light"  # Start with "light" for quick experimentation
)

# Compile the program with few-shot parameters
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,  # Generate up to 3 examples
    max_labeled_demos=4,       # Use up to 4 existing examples
    requires_permission_to_run=False,
)
```

### Zero-Shot Configuration (Optimizing Instructions Only)

To run MIPROv2 in a zero-shot mode, simply set both demonstration limits to zero during compilation:

```python
from dspy.teleprompt import MIPROv2

# Create the MIPROv2 optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light"
)

# Compile the program in zero-shot mode
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=0,  # No generated examples
    max_labeled_demos=0,       # No examples from training set
    requires_permission_to_run=False,
)
```

-----

## Summary and Practice Preview

| Optimizer | Primary Focus | Optimization Mechanism | Recommended Use Case |
| :--- | :--- | :--- | :--- |
| **COPRO** | Instructions Only | Coordinate Ascent (Hill-Climbing) | Focus purely on optimizing instructions, especially for zero-shot prompts. |
| **MIPROv2** | Instructions + Examples | Bayesian Optimization | Optimizing instructions and few-shot examples jointly, especially with moderate-to-large training sets (200+). |

**Guidelines for Selection:**

  * **Instructions Only:** Use **COPRO**.
  * **Instructions + Examples (Best Overall):** Use **MIPROv2**.
  * **Larger Data / Longer Run:** Use **MIPROv2** with `auto="medium"` or `"heavy"` for potentially better results.

In the upcoming practice exercises, you'll gain hands-on experience by implementing both **COPRO** and **MIPROv2** to see how they affect your program's behavior.

In the next lesson, we'll explore **Automatic Finetuning**, the final optimization category in DSPy, which involves updating the weights of the underlying language model itself.

## Tuning COPRO Parameters for Better Instructions

Now that you understand how COPRO works to optimize instructions, let's experiment with its key parameters! In this exercise, you'll tune the COPRO optimizer to see how different settings affect the instruction generation process.

First, you will implement COPRO for a simple math solver. Then, you'll modify two important parameters that control COPRO's behavior:

The breadth parameter, which determines how many candidate prompts are generated in each iteration.
The init_temperature parameter, which controls how diverse or creative the generated prompts will be.
Your task is to implement and run the optimization with different combinations of these parameters and observe how they affect:

The variety of instructions generated.
The optimization progress shown in the output.
The final performance scores.
This hands-on experience will help you develop an intuition for configuring instruction optimizers effectively in your own projects. By the end, you'll have a better understanding of the trade-offs between exploration (trying many diverse candidates) and exploitation (focusing on refining promising instructions).

```python
import dspy
import os
from dspy.teleprompt import COPRO
from dspy.evaluate import Evaluate
from data import get_trainset, get_testset, get_devset, metric

# Set up a simple language model
lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
dspy.configure(lm=lm)

# Define a simple math problem solver program
class MathSolver(dspy.Module):
    def __init__(self):
        super().__init__()
        self.solver = dspy.Predict("question -> answer")
    
    def forward(self, question):
        return self.solver(question=question)

# Get data
trainset = get_trainset()
testset = get_testset()
devset = get_devset()

# Create the base program
base_program = MathSolver()

# Define evaluation parameters
eval_kwargs = dict(num_threads=4, display_progress=True, display_table=0)


# TODO: Create the COPRO optimizer with specified parameters
copro_teleprompter = COPRO(
    
)

# TODO: Compile the program with the optimizer
optimized_program = None


# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

# Evaluate the optimized program
score = evaluator(optimized_program, metric=metric)

# TODO: Show results
print(f"Optimized instruction: {optimized_program.solver.signature}")


```


## Optimizing QA with COPRO Parameters

Now that you've learned about instruction optimization with DSPy, let's put COPRO into practice! In this exercise, you'll implement COPRO to automatically improve instructions for a question-answering system on the HotPotQA dataset.

After learning about the theory behind COPRO and how it uses coordinate ascent to find better instructions, it's time to see it in action. You'll configure the COPRO optimizer with specific parameters and observe how they affect the optimization process.

Your tasks are to:

Load the data from HotPotQA
Complete the COPRO optimizer implementation with appropriate parameters.
Compile your QA pipeline with the optimizer.
Evaluate and compare the performance before and after optimization.
Pay special attention to the breadth parameter (which controls how many candidate instructions to generate) and the init_temperature parameter (which controls instruction diversity). These settings determine how thoroughly COPRO explores possible instructions.

By completing this exercise, you'll gain practical experience with instruction optimization and see firsthand how better instructions can improve model performance without changing your program's structure or adding examples.

```python
import dspy
import os
from dspy.teleprompt import COPRO
from dspy.evaluate import Evaluate
from dspy.datasets import HotPotQA

# --- Configuration ---
# Assuming dspy.OAI fixed the previous error
try:
    # Use dspy.OAI for OpenAI models (common convention in newer dspy versions)
    # Note: Using gpt-4o-mini here, but a better model might be needed for high COPRO performance
    lm = dspy.OAI(model='gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
    dspy.configure(lm=lm)
except Exception as e:
    print(f"LM Configuration Error: {e}")

# --- Data Loading ---
# FIX 1: Removed 'small=True'
# FIX 2: Corrected 'eval_size' to 'dev_size'
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=1, dev_size=50)

# Define splits
train_set = dataset.train
evaluation_set = dataset.dev 
dev_set_for_copro_search = dataset.dev[:20] 

# FIX 3: Convert the evaluation sets to dspy.Example objects with inputs defined
def prepare_data(data_list):
    """Converts a list of dicts/examples into dspy.Example objects with specified inputs."""
    # Ensure all examples have the 'question' field marked as the input
    return [dspy.Example(**d).with_inputs("question") for d in data_list]

train_set_prepared = prepare_data(train_set)
evaluation_set_prepared = prepare_data(evaluation_set)
dev_set_for_copro_search_prepared = prepare_data(dev_set_for_copro_search)

print(f"Loaded {len(train_set_prepared)} examples for training (COPRO's search set).")
print(f"Loaded {len(evaluation_set_prepared)} examples for final evaluation.")

# Define the signature for our QA task
class CoTSignature(dspy.Signature):
    """Answer the question and give the reasoning for the same."""

    question = dspy.InputField(desc="question about something")
    reasoning = dspy.OutputField(desc="reasoning for the answer")
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# Create our Chain of Thought pipeline
class CoTPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.signature = CoTSignature
        self.predictor = dspy.ChainOfThought(self.signature)

    def forward(self, question):
        result = self.predictor(question=question)
        return dspy.Prediction(
            answer=result.answer,
            reasoning=result.reasoning,
        )

# Define our evaluation metric
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    return answer_EM

# --- Baseline Evaluation ---
cot_baseline = CoTPipeline()

# Use the prepared evaluation set
evaluate = Evaluate(devset=evaluation_set_prepared, metric=validate_context_and_answer, display_progress=True, display_table=True)

print("\n" + "="*50)
print("Evaluating baseline model...")
print("="*50)
baseline_score = evaluate(cot_baseline, num_threads=1) 
print(f"Baseline Score: {baseline_score:.4f}")

# --- COPRO Optimization ---
teleprompter = COPRO(
    breadth=10,
    init_temperature=1.0,
    num_evals=100,
    max_bootstrapped_demos=1,
    metric=validate_context_and_answer
)

kwargs = dict(num_threads=1, display_progress=True, display_table=0)

print("\n" + "="*50)
print(f"Starting COPRO Optimization on {len(dev_set_for_copro_search_prepared)} examples...")
print("="*50)
# Use the prepared sets for compilation
compiled_prompt_opt = teleprompter.compile(
    student=cot_baseline,
    trainset=train_set_prepared,
    valset=dev_set_for_copro_search_prepared, # COPRO's internal evaluation set
    **kwargs
)

# --- Optimized Evaluation ---
print("\n" + "="*50)
print("Evaluating optimized model...")
print("="*50)
optimized_score = evaluate(compiled_prompt_opt, num_threads=1)
print(f"Optimized Score: {optimized_score:.4f}")

# --- Display Results and History ---
print("\n" + "="*50)
print("Optimization Results Comparison")
print("="*50)
print(f"Baseline Score (Answer EM):     {baseline_score:.4f}")
print(f"Optimized Score (Answer EM):    {optimized_score:.4f}")
improvement = optimized_score - baseline_score
print(f"Improvement: {'+' if improvement >= 0 else ''}{improvement:.4f}")

# Display the final optimized instruction
print("\n" + "="*50)
print("Final Optimized Instructions in the Pipeline")
print("="*50)
optimized_description = compiled_prompt_opt.predictor.extended_signature.instructions
print(f"Description: {optimized_description}")

```

The `AttributeError: module 'dspy' has no attribute 'OAI'` and `AttributeError: module 'dspy' has no attribute 'DummyLM'` indicate that the code is using **outdated class names** for configuring the Language Model (LM) in the **DSPy** framework.

Recent versions of DSPy **unified all language model providers** (like OpenAI, Anthropic, etc.) under the single, generic class: `dspy.LM`.

### âœ… The Fix

You need to replace the deprecated class names (`dspy.OAI` and `dspy.DummyLM`) with the current, unified class, **`dspy.LM`**.

#### **Corrected Code Block**

The code should be updated as follows:

```python
import dspy
import os
from dspy.teleprompt import COPRO
from dspy.evaluate import Evaluate
from dspy.datasets import HotPotQA

# --- Configuration ---
try:
    # FIX: Use dspy.LM for all model providers (OpenAI, Anthropic, etc.)
    lm = dspy.LM(
        model='openai/gpt-4o-mini', # Use provider/model format
        api_key=os.environ['OPENAI_API_KEY'], 
        api_base=os.environ['OPENAI_BASE_URL']
    )
    dspy.configure(lm=lm)
except Exception as e:
    # FIX: The current equivalent for a dummy/no-op LM is also dspy.LM with specific settings
    # For a simple fallback, using the DummyLM from dspy.utils is not always available at the top level.
    # The safest way is to print the error and continue, but for full functionality, 
    # the LM must be configured correctly. Since the original intent was a fallback:
    print(f"LM Configuration Error: {e}. Cannot configure an active LM.")
    # For a clean fix that avoids the DummyLM error:
    # In a typical setup, you'd ensure environment variables are set or halt.
    # If a fallback is necessary without environment variables, the dspy.LM() call will likely fail 
    # unless you explicitly provide a local/mock model name.
    # We will simply leave the dspy.configure(lm=lm) line outside the try/except if using a standard LM name.
    
# Since the original code had an error handler, we will remove the failing dspy.DummyLM() call 
# and focus on getting the primary LM configuration correct. 
# The simplest fix is to ensure the dspy.LM class is used:
# Note: For production use, you'd handle the exception more robustly.

# ... (rest of the code remains the same)
```

By changing:

1.  `dspy.OAI(...)` to **`dspy.LM('openai/gpt-4o-mini', ...)`**
2.  `dspy.DummyLM()` to avoiding the call or using a properly initialized **`dspy.LM`** if required, you adhere to the modern DSPy API.

### ðŸ”‘ Key Takeaway

In modern DSPy:

  * Use **`dspy.LM(...)`** for all Language Model instantiation.
  * The model name should follow the **`provider/model-name`** format (e.g., `openai/gpt-4o-mini`, `anthropic/claude-3-opus`, etc.) as it integrates with the LiteLLM library.

## Comparing Few-Shot and Zero-Shot Optimization

Now that you've explored COPRO and its parameters, let's dive into MIPROv2 and its unique capabilities! In this exercise, you'll compare few-shot and zero-shot configurations of MIPROv2 on a sentiment analysis task.

After learning about how MIPROv2 can work with or without examples, you'll implement both approaches and analyze the differences. You'll configure MIPROv2 twice:

Once with positive values for max_bootstrapped_demos and max_labeled_demos (few-shot)
Once with both parameters set to 0 (zero-shot)
By comparing the performance and examining the optimized instructions from both configurations, you'll gain practical insights into when to use examples versus when to rely solely on instructions. This hands-on experience will help you make informed decisions about optimization strategies in your own projects.

```python
import dspy
import os
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2
from data import get_trainset, get_testset, get_devset, sentiment_metric


# Set up the language model
lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
dspy.configure(lm=lm)


# Define a simple sentiment analyzer program
class SentimentAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classifier = dspy.Predict("text -> sentiment")
    
    def forward(self, text):
        return self.classifier(text=text)


# Get data
trainset = get_trainset()
testset = get_testset()
devset = get_devset()

# Create the base program
base_program = SentimentAnalyzer()

# Create the MIPROv2 optimizer
teleprompter = MIPROv2(
    metric=sentiment_metric,
    auto="light" # Light optimization run
)

# TODO: Run few-shot optimization
print("\n--- Running MIPROv2 with Few-Shot Configuration ---\n")
# TODO: Implement few-shot configuration with max_bootstrapped_demos and max_labeled_demos set to positive values

# TODO: Run zero-shot optimization
print("\n--- Running MIPROv2 with Zero-Shot Configuration ---\n")
# TODO: Implement zero-shot configuration with max_bootstrapped_demos and max_labeled_demos set to 0

# TODO: Evaluate both configurations
# TODO: Calculate scores for few-shot, zero-shot, and base programs

# TODO: Compare results
print("\n--- Comparison of Different MIPROv2 Configurations ---")
# TODO: Print scores for base program, few-shot, and zero-shot configurations

# TODO: Print optimized instructions
print("\n--- Few-Shot Optimized Instruction ---")
# TODO: Print the few-shot program's optimized instruction

print("\n--- Zero-Shot Optimized Instruction ---")
# TODO: Print the zero-shot program's optimized instruction

# TODO: Analyze differences
print("\n--- Analysis of Differences ---")
# TODO: Compare the performance and instruction differences between the two configurations
```

This exercise demonstrates the flexibility of **MIPROv2**, showing how it can optimize both instructions and examples (**few-shot**) or instructions alone (**zero-shot**).

Here is the completed code that implements both optimization scenarios, evaluates them, and compares the results.

### âœ… Completed MIPROv2 Comparison Code

```python
import dspy
import os
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2
# NOTE: Assuming the 'data' module provides the necessary functions and metric
from data import get_trainset, get_testset, get_devset, sentiment_metric 


# --- Configuration ---
try:
    # Set up the language model using dspy.LM (correct class for all providers)
    # NOTE: You must have OPENAI_API_KEY and OPENAI_BASE_URL set in your environment
    lm = dspy.LM(
        model='openai/gpt-4o-mini', 
        api_key=os.environ['OPENAI_API_KEY'], 
        api_base=os.environ['OPENAI_BASE_URL']
    )
    dspy.configure(lm=lm)
except Exception as e:
    print(f"LM Configuration Error: {e}. Using DummyLM for simulation/structure review.")
    dspy.configure(lm=dspy.DummyLM())


# Define a simple sentiment analyzer program
class SentimentAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        # The Predict module is what MIPROv2 will optimize (its instructions and/or examples)
        self.classifier = dspy.Predict("text -> sentiment")
    
    def forward(self, text):
        return self.classifier(text=text)


# Get data
trainset = get_trainset()
testset = get_testset()
devset = get_devset()

# Create the base program
base_program = SentimentAnalyzer()

# Create the MIPROv2 optimizer
teleprompter = MIPROv2(
    metric=sentiment_metric,
    auto="light" # Light optimization run for quick testing
)

# Define evaluation utility
evaluate = Evaluate(devset=testset, metric=sentiment_metric, display_progress=True, display_table=True)


# --- Baseline Evaluation ---
print("\n--- Evaluating Base Program (Zero-Shot, Unoptimized) ---")
base_score = evaluate(base_program, num_threads=1, display_progress=False)
print(f"Base Program Score: {base_score:.4f}")

# --- Few-Shot Optimization ---
print("\n--- Running MIPROv2 with Few-Shot Configuration ---\n")
# Few-shot configuration: Optimize instructions AND generate/use examples.
compiled_few_shot = teleprompter.compile(
    base_program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,  # Generate up to 3 self-Hashed examples
    max_labeled_demos=2,       # Use up to 2 examples directly from the trainset
    requires_permission_to_run=False,
)

# --- Few-Shot Evaluation ---
print("\n--- Evaluating Few-Shot Optimized Program ---")
few_shot_score = evaluate(compiled_few_shot, num_threads=1, display_progress=False)
print(f"Few-Shot Optimized Score: {few_shot_score:.4f}")

# --- Zero-Shot Optimization ---
print("\n--- Running MIPROv2 with Zero-Shot Configuration ---\n")
# Zero-shot configuration: Optimize instructions ONLY by setting demo limits to 0.
teleprompter_zero = MIPROv2(metric=sentiment_metric, auto="light")
compiled_zero_shot = teleprompter_zero.compile(
    base_program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=0,  # No generated examples
    max_labeled_demos=0,       # No examples from training set
    requires_permission_to_run=False,
)

# --- Zero-Shot Evaluation ---
print("\n--- Evaluating Zero-Shot Optimized Program ---")
zero_shot_score = evaluate(compiled_zero_shot, num_threads=1, display_progress=False)
print(f"Zero-Shot Optimized Score: {zero_shot_score:.4f}")


# --- Comparison of Different MIPROv2 Configurations ---
print("\n" + "="*50)
print("Comparison of Different MIPROv2 Configurations")
print("="*50)
print(f"Baseline Score (Metric):        {base_score:.4f}")
print(f"Few-Shot Optimized Score:       {few_shot_score:.4f}")
print(f"Zero-Shot Optimized Score:      {zero_shot_score:.4f}")


# --- Optimized Instructions ---
print("\n" + "="*50)
print("Optimized Instructions")
print("="*50)

print("\n--- Few-Shot Optimized Instruction (MIPROv2 with Examples) ---")
# The instructions are stored in the predictor's extended signature
few_shot_instructions = compiled_few_shot.classifier.extended_signature.instructions
print(f"Description: {few_shot_instructions}")

print("\n--- Zero-Shot Optimized Instruction (MIPROv2 Instruction-Only) ---")
zero_shot_instructions = compiled_zero_shot.classifier.extended_signature.instructions
print(f"Description: {zero_shot_instructions}")


# --- Analysis of Differences ---
print("\n" + "="*50)
print("Analysis of Differences")
print("="*50)
print(f"Performance Improvement: Few-Shot (+{few_shot_score - base_score:.4f}) vs. Zero-Shot (+{zero_shot_score - base_score:.4f})")

if few_shot_score > zero_shot_score:
    print("\n**Few-Shot Analysis:** The Few-Shot configuration typically performs **better** (or equal) because it leverages both an **optimized instruction** AND **contextual examples** (demonstrations). The MIPROv2 optimizer selects a set of examples and then finds an instruction tailored to maximize performance *with those specific examples*.")
else:
    print("\n**Zero-Shot Analysis:** In some cases, the Zero-Shot configuration may perform better. This happens if the task is highly sensitive to the *quality* of the instructions, and the added examples (demos) selected by MIPROv2 were noisy or misleading. The Zero-Shot mode forces the optimizer to find the **most robust, self-contained instruction** possible.")
```

## Balancing Optimization Intensity for Better Results

Now that you've explored both COPRO and MIPROv2, let's dive deeper into MIPROv2's optimization intensity settings! In this exercise, you'll compare how different optimization intensities affect performance and runtime.

You'll work with a sentiment analysis program that's already set up with a "light" optimization. Your tasks are to:

Add code to run MIPROv2 with the "heavy" setting.
Track and compare the runtime of both optimization processes.
Analyze the differences in performance and generated instructions.
Determine if the extra computation time is worth the potential gains.
This exercise will help you make informed decisions about when to use more intensive optimization in your own projects. By comparing the number of trials, runtime, and performance improvements, you'll develop a practical understanding of the trade-offs involved in instruction optimization.

```python
import dspy
import os
import time
from dspy.teleprompt import MIPROv2
from dspy.evaluate import Evaluate
from data import get_trainset, get_testset, get_devset, sentiment_metric


# Set up the language model
lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
dspy.configure(lm=lm)


# Define a simple sentiment analyzer program
class SentimentAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classifier = dspy.Predict("text -> sentiment")
    
    def forward(self, text):
        return self.classifier(text=text)

# Get data
trainset = get_trainset()
testset = get_testset()
devset = get_devset()

# Create the base program
base_program = SentimentAnalyzer()

# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=devset, display_progress=True, display_table=5)

# Evaluate the base program
base_score = evaluator(base_program, metric=sentiment_metric)
print(f"Base program score: {base_score:.4f}")

# Run light optimization
print("\n--- Running MIPROv2 with 'light' optimization ---\n")
light_teleprompter = MIPROv2(
    metric=sentiment_metric,
    auto="light",  # Light optimization run (fewer trials)
)

# Time the light optimization
light_start_time = time.time()
light_program = light_teleprompter.compile(
    base_program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=2,
    max_labeled_demos=3,
    requires_permission_to_run=False
)
light_end_time = time.time()
light_duration = light_end_time - light_start_time

# TODO: Create a new MIPROv2 optimizer with "heavy" optimization
# Add your code here to create a heavy teleprompter

# TODO: Time the heavy optimization
# Add your code here to measure the runtime of heavy optimization

# TODO: Evaluate both optimized programs
light_score = evaluator(light_program, metric=sentiment_metric)
# Add your code here to evaluate the heavy program

# Compare results
print("\n--- Comparison of Different MIPROv2 Intensities ---")
print(f"Base program score: {base_score:.4f}")
print(f"Light optimization score: {light_score:.4f}")
# TODO: Add code to print the heavy optimization score

# TODO: Compare runtime
print("\n--- Runtime Comparison ---")
print(f"Light optimization runtime: {light_duration:.2f} seconds")
# TODO: Add code to print the heavy optimization runtime and comparison

# Print optimized instructions
print("\n--- Light Optimization Instruction ---")
print(light_program.classifier.signature)

# TODO: Print the heavy optimization instruction
# Add your code here to print the heavy program's instruction

# TODO: Analyze differences
print("\n--- Analysis of Differences ---")
# TODO: Add code to compare performance, runtime, and instructions
# Add your observations about the differences between light and heavy optimization

```

This exercise requires implementing the **"heavy"** MIPROv2 optimization to compare its trade-offs against the **"light"** setting. The core difference will be observed in runtime, cost (due to more trials), and potentially the final performance score.

Here is the completed code with the required additions.

### âœ… Completed Optimization Intensity Comparison

```python
import dspy
import os
import time
from dspy.teleprompt import MIPROv2
from dspy.evaluate import Evaluate
# NOTE: Assuming the 'data' module provides the necessary functions and metric
from data import get_trainset, get_testset, get_devset, sentiment_metric 

# --- Configuration ---
try:
    # Set up the language model using dspy.LM (correct class for all providers)
    # NOTE: You must have OPENAI_API_KEY and OPENAI_BASE_URL set in your environment
    lm = dspy.LM(
        model='openai/gpt-4o-mini', 
        api_key=os.environ['OPENAI_API_KEY'], 
        api_base=os.environ['OPENAI_BASE_URL']
    )
    dspy.configure(lm=lm)
except Exception as e:
    print(f"LM Configuration Error: {e}. Using DummyLM for simulation/structure review.")
    dspy.configure(lm=dspy.DummyLM())


# Define a simple sentiment analyzer program
class SentimentAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classifier = dspy.Predict("text -> sentiment")
    
    def forward(self, text):
        result = self.classifier(text=text)
        return dspy.Prediction(sentiment=result.sentiment) # Ensure output matches the signature if needed

# Get data
trainset = get_trainset()
testset = get_testset()
devset = get_devset()

# Create the base program
base_program = SentimentAnalyzer()

# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=devset, metric=sentiment_metric, display_progress=False, display_table=False)

# Evaluate the base program
base_score = evaluator(base_program, metric=sentiment_metric)
print(f"Base program score: {base_score:.4f}")

# --- Light Optimization ---
print("\n--- Running MIPROv2 with 'light' optimization ---\n")
light_teleprompter = MIPROv2(
    metric=sentiment_metric,
    auto="light",  # Light optimization run (fewer trials)
)

# Time the light optimization
light_start_time = time.time()
light_program = light_teleprompter.compile(
    base_program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=2,
    max_labeled_demos=3,
    requires_permission_to_run=False
)
light_end_time = time.time()
light_duration = light_end_time - light_start_time

# --- Heavy Optimization ---
# TODO: Create a new MIPROv2 optimizer with "heavy" optimization
print("\n--- Running MIPROv2 with 'heavy' optimization ---\n")
heavy_teleprompter = MIPROv2(
    metric=sentiment_metric,
    auto="heavy",  # Heavy optimization run (more trials/candidates)
)

# TODO: Time the heavy optimization
heavy_start_time = time.time()
heavy_program = heavy_teleprompter.compile(
    base_program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=2,
    max_labeled_demos=3,
    requires_permission_to_run=False
)
heavy_end_time = time.time()
heavy_duration = heavy_end_time - heavy_start_time

# --- Final Evaluation ---
light_score = evaluator(light_program, metric=sentiment_metric)
# TODO: Add your code here to evaluate the heavy program
heavy_score = evaluator(heavy_program, metric=sentiment_metric)

# --- Comparison of Results ---
print("\n" + "="*50)
print("Comparison of Different MIPROv2 Intensities")
print("="*50)

# Compare scores
print(f"Base program score:       {base_score:.4f}")
print(f"Light optimization score: {light_score:.4f}")
print(f"Heavy optimization score: {heavy_score:.4f}") # Print the heavy optimization score

# Compare runtime
print("\n--- Runtime Comparison ---")
print(f"Light optimization runtime: {light_duration:.2f} seconds")
print(f"Heavy optimization runtime: {heavy_duration:.2f} seconds") # Print the heavy optimization runtime
time_ratio = heavy_duration / light_duration if light_duration > 0 else float('inf')
print(f"Heavy optimization ran approximately {time_ratio:.1f}x longer than Light.")

# Print optimized instructions
print("\n--- Optimized Instructions ---")
print("\n--- Light Optimization Instruction ---")
print(f"Instruction: {light_program.classifier.extended_signature.instructions}")

# TODO: Print the heavy optimization instruction
print("\n--- Heavy Optimization Instruction ---")
print(f"Instruction: {heavy_program.classifier.extended_signature.instructions}")

# --- Analysis of Differences ---
print("\n" + "="*50)
print("Analysis of Differences")
print("="*50)
score_diff = heavy_score - light_score
print(f"Performance Difference (Heavy - Light): {score_diff:+.4f}")

# TODO: Add code to compare performance, runtime, and instructions
print("\n**Performance:**")
if score_diff > 0.0:
    print(f"The Heavy optimization was better, improving the score by {score_diff:.4f} over the Light setting.")
elif score_diff < 0.0:
    print(f"The Light optimization unexpectedly performed better, with the Heavy setting resulting in a {-score_diff:.4f} lower score. This suggests the limited data led to overfitting during the longer Heavy run.")
else:
    print("Both Light and Heavy optimization achieved similar final scores.")

print("\n**Runtime & Cost:**")
print(f"The Heavy setting took {time_ratio:.1f}x longer than the Light setting. Since MIPROv2 is inference-heavy (especially the Instruction Proposal and Bayesian Optimization stages), the Heavy setting involves significantly more API calls and computational time.")

print("\n**Instruction/Demo Differences:**")
print("The primary difference between the resulting programs is the **instruction phrasing** and the **set of few-shot examples** (demos). The Heavy setting explores a much larger combination space of these elements (more trials in the Bayesian Optimization) to find a marginal or substantial performance gain. The two optimized instructions are likely phrased differently, reflecting the optimizer's best effort to guide the model for its specific, selected few-shot set.")

```