## Article: How I unlocked tremendous value by Automating Prompt Engineering at Enterprise Scale

Section 1: The Problem: "Manual Prompt Engineering is non-deterministic and doesn't scale. I needed a system."
additional context: I wanted to unlock value from interaction data with the company such as calls & chats. I had tried finetuning, manual prompt engineering for years.

Section 2: The "internal champion ‚Üí team adoption"
What convinced you first (early results? a specific win?)
How you got buy-in from others (demo? pilot project?)

Section 2: The Architecture
Diagram your flow: DSPy -> Genetic Mutation -> Evaluation Harness -> Pareto Selection.
Discuss the constraints: How we did a POC, scaled across teams for v1 of prompts

Section 3: The Evaluation Engine
built a system targetting feedback for failing  cases

Section 4: The Impact -> manual prompt engineering to automated savings

Section 5: What we learned and how we would do it going forward.
clear target and dog fooded dataset(v1) -> run gepa -> targetted feedback for failing cases -> run gepa -> stop when we are fine with metrics -> deploy
post deployment -> look for failing cases -> add to dataset.

## Article: How I Unlocked Tremendous Value by Automating Prompt Engineering at Enterprise Scale

**Section 1: The Problem**
* POC with prompts look interesting -> now we need it work at scale.
* Manual prompt engineering is non-deterministic and doesn't scale. I needed a system to unlock value from unstructured customer interaction data. Years of fine-tuning and manual prompt iteration weren't cutting it.
* I was so desperate for a solution as LLMs gave good feedbacks, why cant it convert it into good prompts
* I was reading a lot before i stumbled upon DSPy and prompt optimizers. It kind of clicked and when this new optimizer GEPA came in. I wanted to test it.

**Section 2: From Pilot to Adoption**
* I created a dummy dataset resembling a generic problem in interaction  to see if its moving in right direction.
* I expected weeks of iteration but got results in one run.
* What convinced me: early results on a pilot use case
* Initial quality of prompts impressed me while it was in its loop.
* Ran for 10 Hours $2 and 200+ ‚Üí 9 survivors
* How I got buy-in: demo showing measurable improvement over baseline

**Section 3: The Architecture**
High level flow Diagram of how GEPA works: DSPy ‚Üí Genetic Mutation ‚Üí Evaluation Harness ‚Üí Pareto Selection
refer to https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html for deepdive
- I setup this experiment
Task: Analyzing sales call transcripts to classify agent behaviors into categories (like "introduction_rapport_building", "objection_handling") and predict call outcome (good/bad)
Dataset: 27 labeled call transcripts split into train/val/test sets
Base model: A CallAnalysisAgent using dspy.ChainOfThought that reads transcripts and outputs categories + final result
Optimization: GEPA iteratively rewrites the prompt instructions, testing variants on the validation set. It maintains a "Pareto front" of candidate prompts that excel on different examples
Results:72% -> 81% accuracy
* Started with a focused POC on a single use case, everyone adopted seeing how fast and better it is.

**Section 4: Using error analysis on top of GEPA**
* Results were promising (81% accuracy) which i used internally and it worked everywhere where it was implemented properly
* But i faced the next challenge on how it take to further heights. Because the cases which it failed were not that difficult, llm could identify if given an hint.(that was my hunch)
* Problem: how to guide the automation to the type of prompt to make the cases it was failing.
| Input (truncated) | Actual | Pred | targeted Feedback |
|---|---|---|---|
| Tyler calling from Amex... "sounds good, I'll send that link..." | bad | good | The customer showed hesitation ("I'm really not sure...") and the agent rushed to close without addressing concerns. A "bad" call lacks proper objection handling before closing. |
| Mark from Amex calling about credit solutions... | bad (intro_rapport=false) | bad (intro_rapport=true) | The agent said "Hi, this is Mark from Amex" without warmth, time acknowledgment, or rapport-building. A bare introduction doesn't qualify as introduction_rapport_building. |
| John from American Express... | bad (intro_rapport=false) | bad (intro_rapport=true) | Similar‚Äîthe agent jumped straight into the pitch. Greeting alone isn't rapport-building; look for courtesy, time check, or warmth signals. |

The pattern: the model over-detects `introduction_rapport_building` (any greeting = rapport) and sometimes misses when a call is actually "bad" despite having next steps.
* Built a system using excel that generates targeted feedback for failing cases
* Results improved to 81% -> 90%

**Section 5: The Impact**
* 72% -> 81% -> 90% was right in line and process was repeatable. All i need to do is error analysis and give proper feedback.
* Shifted from manual prompt iteration (days/weeks) to automated optimization (hours). Significant reduction in prompt engineering effort per use case.

**Section 6: Lessons Learned & Recommended Workflow**
when not to use GEPA: when y variable is unclear, you are not able to do it clearly as a human.
1. Start with a clear target metric and a dogfooded dataset (v1)
2. Run GEPA ‚Üí review failing cases ‚Üí add targeted feedback
3. Run GEPA again ‚Üí stop when metrics are acceptable ‚Üí deploy
4. Post-deployment: monitor for failures ‚Üí add to dataset ‚Üí iterate

**Section 7: Production considerations**
1. dspy integration with mlflow is one straight forward for monitoring & versioning - linking article.
2. when to add the failure case to the evaluation dataset is a business call
---

In [None]:
from dialoghelper import *
tool_info()

Tools available from dialoghelper: &`[curr_dialog, msg_idx, add_html, find_msg_id, find_msgs, read_msg, del_msg, add_msg, update_msg, msg_insert_line, msg_str_replace, msg_strs_replace, msg_replace_lines]`

### this article is for engineering leaders or practitioners

## Article: How I Unlocked Tremendous Value by Automating Prompt Engineering at Enterprise Scale

## **Section 1: The Problem**

I'd built plenty of successful POCs with LLM prompts‚Äîthey always looked impressive in demos. But when it came time to deploy at scale, everything fell apart. Manual prompt engineering is fundamentally non-deterministic: what works today might fail tomorrow, and there's no systematic way to improve it.

My challenge was unlocking value from years of unstructured customer interaction data. I'd spent three weeks fine-tuning a single prompt by hand, iterating through variations, hoping to stumble on something that worked consistently. It wasn't cutting it.

**Then it hit me:** LLMs are remarkably good at *giving feedback*. They can look at their own outputs and tell you exactly what went wrong. So why couldn't they convert that feedback into better prompts automatically? Surely something existed to solve this.

I searched for tools that could close this loop automatically[^1]. That's when I stumbled upon [DSPy](https://thedataquarry.com/blog/learning-dspy-1-the-power-of-good-abstractions/) and the concept of prompt optimizers‚Äîsystems that treat prompt engineering as an optimization problem rather than an art. It clicked immediately. When [GEPA](https://dspy.ai/tutorials/gepa_aime/) was released, I knew I had to test it.

Think of it like compilation: you write high-level code (your task definition), and the compiler transforms it into optimized machine instructions (a battle-tested prompt). You don't hand-tune assembly‚Äîso why hand-tune prompts?

## **Section 2: The Pilot**

I needed a simple test to see if GEPA could work for my use case. So I created a synthetic dataset of 27 sales call transcripts that represented a real challenge we face: **detecting presence of required behaviors and predicting call quality (good/bad)**. The transcripts were hand-labeled across 7 behavior categories `(introduction, needs, value proposition, objection handling, benefit reinforcement, risk reduction, and closing)`. Small enough to iterate fast, realistic enough to validate the approach‚Äîand representative of a problem I'd hit repeatedly: intent extraction and call evaluation look easy for a few cases, but precision and recall tank at scale.

I expected weeks of iteration. Instead, I got meaningful results in a single run.
Usually a show piece like this would be carefully selected sample to show the power of the approach. Here this is literally first attempt, that in itself tells the power of the approach.

### Results

| Approach | Cost | Time | Accuracy |
|----------|------|------|----------|
| Manual prompt engineering | $100-1000 (engineer time) | Days to weeks | 72% |
| **GEPA** | ~$2 | 10 hours | 81% |
| GEPA with error analysis | ~$0.5 | 3 hours | 90% |

The optimizer ran for about 10 hours, cost roughly $2, and explored over 200 prompt variants. Through genetic mutation and Pareto selection, it whittled those down to 9 "survivors"‚Äîprompts that excelled at different subsets of the problem. The best performer jumped from 72% to 81% accuracy, a lift I hadn't achieved in months of manual tuning.

What really convinced me wasn't just the final number‚Äîit was watching the intermediate prompts evolve. I could see the optimizer discovering nuances I'd never thought to include: explicit definitions for each category, step-by-step rules for edge cases, domain-specific guidance about soft pulls versus hard pulls. The quality of the reasoning it produced while iterating was genuinely impressive.

Hopefully I have convinced you that this method is powerful, lets see how i did it and you can follow similar steps for yours as well.

In [None]:
%%html
<div style="border: 2px solid #f4f4f4; padding: 15px; border-radius: 8px; background: #f2f2f2ff;">
  <h3 style="color: #1d1d1dff;">Sample Dataset</h3>
  <table style="width: 100%; border-collapse: collapse; font-size: 13px;">
    <thead>
      <tr style="background-color: #27ae60; color: white;">
        <th style="padding: 10px; text-align: left; border-radius: 4px 0 0 0;">transcript (Input)</th>
        <th style="padding: 10px; text-align: left;">Categories present (Output)</th>
        <th style="padding: 10px; text-align: left; border-radius: 0 4px 0 0;">Call Quality (Output)</th>
      </tr>
    </thead>
    <tbody>
      <tr style="background: white;">
        <td style="padding: 10px; border-bottom: 1px solid #ddd;">agent: Hi, good afternoon ‚Äî this is Maya calling from American Express...</td>
        <td style="padding: 10px; border-bottom: 1px solid #ddd;">intro_rapport, needs_assessment, objection_handling, benefit_reinforcement, risk_reduction, call_to_action</td>
        <td style="padding: 10px; border-bottom: 1px solid #ddd; color: #27ae60; font-weight: bold;">good</td>
      </tr>
      <tr style="background: #f9f9f9;">
        <td style="padding: 10px; border-bottom: 1px solid #ddd;">agent: Hi! Good morning‚Äîthis is Tyler calling from Amex...</td>
        <td style="padding: 10px; border-bottom: 1px solid #ddd;">intro_rapport, needs_assessment, value_prop, objection_handling</td>
        <td style="padding: 10px; border-bottom: 1px solid #ddd; color: #c0392b; font-weight: bold;">bad</td>
      </tr>
      <tr style="background: white;">
        <td style="padding: 10px;">agent: Hi, this is Mark from Amex, calling about our credit solutions...</td>
        <td style="padding: 10px;">value_prop, call_to_action</td>
        <td style="padding: 10px; color: #c0392b; font-weight: bold;">bad</td>
      </tr>
    </tbody>
  </table>
</div>

transcript (Input),Categories present (Output),Call Quality (Output)
"agent: Hi, good afternoon ‚Äî this is Maya calling from American Express...","intro_rapport, needs_assessment, objection_handling, benefit_reinforcement, risk_reduction, call_to_action",good
agent: Hi! Good morning‚Äîthis is Tyler calling from Amex...,"intro_rapport, needs_assessment, value_prop, objection_handling",bad
"agent: Hi, this is Mark from Amex, calling about our credit solutions...","value_prop, call_to_action",bad


In [None]:
%%html
<div style="display:flex; gap:20px; font-family:system-ui,-apple-system,sans-serif;">
  <div style="flex:1; background:#f8f9fa; border-radius:12px; padding:20px; border:1px solid #e0e0e0;">
    <h3 style="margin:0 0 12px 0; color:#1a73e8; font-size:14px; text-transform:uppercase; letter-spacing:1px;">üìû Input: Call Transcript</h3>
    <div style="background:white; padding:16px; border-radius:8px; font-size:13px; line-height:1.6; max-height:300px; overflow-y:auto; white-space:pre-wrap; color:#333;">agent: Hi, good afternoon ‚Äî this is Maya calling from American Express. Am I speaking with Jordan Lee?

customer: Yes, this is Jordan.

agent: Great, Jordan. How are you doing today?

customer: I'm good, thanks. Busy afternoon, but I have a few minutes.

agent: I appreciate you taking the time...</div>
  </div>
  <div style="flex:1; background:#f0f7f0; border-radius:12px; padding:20px; border:1px solid #c8e6c9;">
    <h3 style="margin:0 0 12px 0; color:#2e7d32; font-size:14px; text-transform:uppercase; letter-spacing:1px;">üìä Output: Analysis</h3>
    <div style="background:white; padding:16px; border-radius:8px; margin-bottom:12px;">
      <div style="font-size:12px; color:#666; margin-bottom:4px;">Call Quality</div>
      <div style="font-size:24px; font-weight:600; color:#2e7d32;">‚úì Good</div>
    </div>
    <div style="background:white; padding:16px; border-radius:8px;">
      <div style="font-size:12px; color:#666; margin-bottom:8px;">Detected Categories</div>
      <div style="display:flex; flex-wrap:wrap; gap:6px;">
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Introduction/Rapport ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Need Assessment ‚úì</span>
        <span style="background:#ffebee; color:#c62828; padding:4px 10px; border-radius:16px; font-size:12px;">Value Proposition ‚úó</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Objection Handling ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Benefit Reinforcement ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Risk Reduction ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Call to Action ‚úì</span>
      </div>
    </div>
  </div>
</div>

**Section 3: The Architecture**

Let me walk you through how I actually set this up. At its core, GEPA follows a simple loop: generate prompt variants ‚Üí evaluate them ‚Üí select the best survivors ‚Üí repeat. But the magic is in the details.

**The Setup**

I used DSPy's `ChainOfThought` module to define my task: given a call transcript, output (1) which behavioral categories the agent demonstrated, and (2) whether the call outcome was good or bad. My initial prompt was embarrassingly simple‚Äîjust two lines describing what I wanted. The model knew *what* to do but had no guidance on *how* to do it well.

**How GEPA Works**

The optimizer runs in iterations. Each cycle:
1. **Genetic Mutation**: The LLM looks at failing cases and their feedback, then proposes improved instructions. It's not random mutation‚Äîit's *reflective* mutation, guided by what went wrong.
2. **Evaluation Harness**: Each candidate prompt runs against the validation set and gets scored.
3. **Pareto Selection**: Here's the clever bit‚ÄîGEPA doesn't just keep the single best prompt. It maintains a "Pareto front" of diverse specialists. One prompt might excel at detecting objection handling; another might be better at call outcome prediction. This diversity prevents the optimizer from over-fitting to one pattern and forgetting others.

**What I Observed**

My 2-line prompt evolved into a ~1,500 word instruction set. The optimizer discovered things I'd never explicitly taught it: precise definitions for each category, rules for ambiguous cases ("a bare greeting isn't rapport-building‚Äîlook for warmth and time acknowledgment"), and domain-specific guidance about soft pulls and secure application links.

The result: 72% ‚Üí 81% accuracy. More importantly, the process was repeatable. I wasn't guessing anymore‚ÄîI had a system.

For a deeper technical dive, I wrote up the internals here: [GEPA Deepdive](https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html)</cellsource></message>
</invoke>

In [None]:
final_prompt = """New Instructions for Analyzing Banking/Card Transaction Call Transcripts

Overview
You are an analysis assistant whose job is to evaluate sales/transact‚Äëion-focused call transcripts in the banking/credit-card domain. For each transcript, produce a compact, structured analysis with two main objectives:
  (a) identify the agent behavior categories demonstrated (from the seven pillars below), and
  (b) judge the likely business outcome of the call (conversion, retention, or mixed).

Inputs you will receive
- A complete transcript of a single call between an agent and a customer. Transcripts may include labels such as ‚Äúagent:‚Äù and ‚Äúcustomer:‚Äù and may cover topics like card offers, fees, rewards, security, and next steps.

What you must produce (three sections exactly)
1) reasoning
   - Provide a concise, bullets-style justification for every pillar category you detected in the transcript.
   - Include short quotes or paraphrases from the transcript to illustrate why the category applies. Do not introduce facts or assumptions beyond what is in the transcript.
   - If you detect a strength/weakness signal about the outcome, include a brief, one- to two-sentence note here describing how strong the signal is and what would push it toward conversion or toward retention.
   - This section may contain a small, optional note about outcome strength, but must not introduce information outside the transcript.

2) categories
   - Output a Python-like list of the detected pillar categories in the exact order they first appeared in the transcript.
   - Example format: ['introduction_rapport_building', 'need_assessment_qualification', ...]

3) final_result
   - A single word indicating the likely business outcome:
     - conversion ‚Äî the call is progressing toward an immediate or near-term application/upgrade/activation.
     - retention ‚Äî the call focuses on keeping an existing customer, avoiding churn, or upselling within retention without an immediate conversion.
     - mixed ‚Äî signals of both retention and conversion, or the outcome is uncertain and depends on future steps.
   - Do not add any qualifiers in this field; use exactly one of the three keywords above.

Optional but encouraged: assess the strength of the outcome
- If you include it (recommendation), place this assessment only in reasoning as the optional strength_of_outcome note. Keep it concise (one or two sentences). It should address:
  - How strong is the conversion/retention signal?
  - What would most likely push the outcome toward conversion, or toward retention?

Pillar definitions (seven bank/card-specific categories)
- introduction_rapport_building
  - Includes opening greetings, courtesy, acknowledgment of time, and attempts to establish rapport.
  - Examples: greetings, confirming time, polite introductions, small talk about fit or time constraints.
- need_assessment_qualification
  - Involves asking about customer needs, usage, spend patterns, eligibility checks, and whether the product fits (e.g., business vs personal, employee cards, annual fees, soft vs hard pulls).
- value_proposition_feature_mapping
  - Linking card features to tangible, customer-relevant benefits (rewards, protections, credits) and showing how those features align with stated needs.
- objection_handling
  - Addressing concerns about price, complexity, trust, enrollment, or process obstacles. Includes acknowledging concerns and offering clarifications or mitigations.
- benefit_reinforcement
  - Reiterating concrete benefits and value after objections or hesitations, often tying back to the customer‚Äôs stated needs.
- risk_reduction_trust_building
  - Providing security assurances, privacy protections, non-hard-pull options, guarantees, terms clarity, or brand trust signals.
- call_to_action_closing
  - Concrete next steps or commitments: soft checks, secure links, email/mail options, scheduling follow-ups, or instructions to apply/get more information.

How to apply the rules
- For every transcript, read from start to finish. Mark each pillar as soon as its criteria are clearly demonstrated.
- If a single utterance clearly satisfies more than one pillar, count it under all applicable pillars.
- If a pillar is not clearly demonstrated anywhere in the transcript, do not include it in the categories list.
- Record the detected pillars in the exact order of their first appearance in the transcript.
- The final_result should reflect the overall trajectory of the call as described above.

Output constraints and format
- Do not introduce any facts not present in the transcript.
- Do not insert subjective opinions beyond what is grounded in the transcript.
- Use the exact section headings and formatting:
  reasoning
  categories
  final_result
- Do not include extraneous content beyond the three sections above.

Domain-specific considerations
- You may encounter references to soft pulls vs hard pulls, online applications, secure links, email follow-ups, or scheduled follow-ups. Treat these as legitimate ‚Äúcall_to_action_closing‚Äù or ‚Äúrisk_reduction_trust_building‚Äù elements as appropriate.
- When quoting or paraphrasing, keep quotes brief and focused on the reason for the pillar.
- If PII appears in the transcript (e.g., partial SSN or addresses), quote minimally and do not reveal full sensitive data in your justification. You may paraphrase or reference the presence of sensitive data without reproducing it.

Example behavior (not to reproduce here)
- A transcript with strong, explicit next steps to apply or pre-qualify is more conversion-oriented; a transcript focusing solely on questions and reassurance without a clear next step is more retention-oriented or mixed.

End result
- Return exactly three sections for every transcript analyzed, with the content governed by the rules above. This format enables consistent, comparable, and transparent analysis across transcripts."""

## **Section 3: The Architecture**

We are going to use DSPy for using GEPA.

> *If you're new to DSPy: it's a framework that treats prompts as code you can optimize programmatically, rather than strings you tweak by hand. For a deeper technical introduction, see [The Data Quarry's guide](https://thedataquarry.com/blog/learning-dspy-3-working-with-optimizers/).*

### Prerequisites for Optimization

Before running any optimizer, you need three things:

| Component | What it does | Why it matters |
|-----------|--------------|----------------|
| **DSPy Signature (The Prompt)** | Your baseline module defining the task | The "thing" being optimized |
| **Metric Function** | Returns score and textual feedback | Tells optimizer what "good" looks like |
| **Dataset** | Labeled examples (train/val/test) | Ground truth for evaluation |

The key insight: **optimizers need textual feedback, not just scores.** If you only return a number, the optimizer is flying blind. If you return *why* something failed, it can propose smarter mutations.

### The DSPy Setup

**Code Walkthrough**

I used DSPy's `ChainOfThought` module to define my task: given a call transcript, output (1) which behavioral categories the agent demonstrated, and (2) whether the call outcome was good or bad. My initial prompt was embarrassingly simple‚Äîjust two lines describing what I wanted:

```python
class CallAnalysis(dspy.Signature):
    """
    Read the provided call transcript and analyze it comprehensively.
    Determine both: (1) which categories the agent displayed, and 
    (2) whether the call will lead to conversion or customer retention.
    """
    message: str = dspy.InputField()
    categories: List[Literal["introduction_rapport_building", "need_assessment_qualification", 
                             "value_proposition_feature_mapping", "objection_handling", 
                             "benefit_reinforcement", "risk_reduction_trust_building", 
                             "call_to_action_closing"]] = dspy.OutputField()
    final_result: Literal['good', 'bad'] = dspy.OutputField()

program = dspy.ChainOfThought(CallAnalysis)
```

The model knew *what* to do but had no guidance on *how* to do it well.

**Metrics for success**

```python
def call_qual_metric(gold, pred):
    return 1.0 if gold == pred else 0.0

def category_qual_metric(gold, pred):
    """Compute score for categories using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correct = len(gold_true & pred_set) + len(gold_false - pred_set)
    return correct / len(gold)

def comb_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Overall metric combining both scores."""
    call_qual = call_qual_metric(gold.final_result, pred.final_result)
    category_qual = category_qual_metric(gold.categories, pred.categories)
    return (call_qual + category_qual) / 2
```

### Adding in GEPA

This is the key enabler for GEPA's reflective mutation‚Äîreturning *textual feedback*, not just a score:

![A simplified GEPA flow diagram](gepa_flow_simplified.png)

The insight: when GEPA sees a failure, it doesn't just know "this was wrong"‚Äîit knows *why* and can propose smarter mutations.


```python
def call_qual_feedback(gold, pred):
    """ Generate feedback for final result module. """
    if gold == pred:
        fb = f"You correctly classified the sales call as `{gold}`. This sales call is indeed `{gold}`."
    else:
        fb = f"You incorrectly classified the sales call as `{pred}`. The correct sales call is `{gold}`. Think about how you could have reasoned to get the correct sales call label."
    return fb

def category_qual_feedback(gold, pred):
    """Generate feedback using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correctly_included = gold_true & pred_set
    incorrectly_included = gold_false & pred_set
    incorrectly_excluded = gold_true - pred_set
    correctly_excluded = gold_false - pred_set
    
    score = (len(correctly_included) + len(correctly_excluded)) / len(gold)
    
    if score == 1.0:
        return f"Perfect. Correctly identified: `{correctly_included}`.", score
    
    fb = f"Correctly identified: `{correctly_included}`.\n"
    if incorrectly_included:
        fb += f"False positives: `{incorrectly_included}`.\n"
    if incorrectly_excluded:
        fb += f"Missed: `{incorrectly_excluded}`.\n"
    return fb

def comb_metric_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    Computes a score and provides feedback for the call analysis prediction.
    Returns total score if pred_name is None, otherwise returns dspy.Prediction with score and feedback.
    """
    # Compute feedback and scores
    cal_fb = call_qual_feedback(gold.final_result, pred.final_result)
    cat_fb = category_qual_feedback(gold.categories, pred.categories)
    fb = cal_fb + '\n' + cat_fb
    score = comb_metric(gold, pred)
    return dspy.Prediction(score=score, feedback=fb)
```

### How GEPA Works Under the Hood

Before showing the code, let's understand what GEPA actually does:

1. **Reflective Mutation** ‚Äî Unlike random genetic mutation, GEPA's LLM *reads the failure feedback* and proposes targeted improvements. It's not guessing‚Äîit's reasoning about what went wrong.

2. **Pareto Selection** ‚Äî Instead of keeping only the single best prompt, GEPA maintains a "frontier" of diverse specialists. One prompt might excel at detecting objection handling; another at predicting call outcomes. This prevents catastrophic forgetting.

3. **Text-as-Feedback** ‚Äî Traditional RL uses scalar rewards. GEPA exploits rich textual feedback ("You incorrectly marked this as rapport-building because...") to guide mutations more precisely.

### Running GEPA

With prerequisites in place, the optimizer setup is straightforward:

```python
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,  # Returns score + textual feedback for failures
    auto="light",                 # Budget setting (use "heavy" for production)
)

optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)
```

### What I Observed

My 2-line prompt evolved into a ~1,500 word instruction set. The optimizer discovered things I'd never explicitly taught it: precise definitions for each category, rules for ambiguous cases ("a bare greeting isn't rapport-building‚Äîlook for warmth and time acknowledgment"), and domain-specific guidance about soft pulls and secure application links.

The result: 72% ‚Üí 81% accuracy. More importantly, the process was repeatable. I wasn't guessing anymore‚ÄîI had a system.

The optimizer discovered nuances I'd never thought to include‚Äîlike "a bare greeting isn't rapport-building; look for warmth and time acknowledgment."

For a deeper technical dive, I wrote up the internals here: [GEPA Deepdive](https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html)

## **Section 3: Detailed Implementation of GEPA**

Let's first understand what GEPA does, then dive into code for this specific usecase. If you want deeper dive on how GEPA works, i have previously written a detailed piece here: [GEPA](https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html)

For those who prefer to dive straight into code, here's the complete notebook: [github link](https://github.com/risheekkumarb/gepa-deepdive)

>We'll use DSPy to run GEPA. If you're new to DSPy, it's a framework that treats prompts as code you can optimize programmatically. For background, see [The Data Quarry's guide](https://thedataquarry.com/blog/learning-dspy-3-working-with-optimizers/).

---

### How GEPA Works

GEPA (Genetic-Pareto Algorithm) differs from traditional optimization in three key ways:

1. **Reflective Mutation** ‚Äî The LLM *reads failure feedback* and proposes targeted improvements. It's not random guessing‚Äîit's reasoning about what went wrong.

2. **Pareto Selection** ‚Äî Instead of keeping only the single best prompt, GEPA maintains a "frontier" of diverse specialists. One prompt might excel at detecting objection handling; another at predicting outcomes. This prevents catastrophic forgetting.

3. **Text-as-Feedback** ‚Äî Traditional RL uses scalar rewards. GEPA exploits rich textual feedback ("You incorrectly marked this as rapport-building because...") to guide mutations precisely.

---

### Prerequisites

To Use GEPA, we need 3 components.

| Component | What it does | Why it matters |
|-----------|--------------|----------------|
| **DSPy Signature** | Your baseline prompt defining the task | The "prompt" being optimized |
| **Metric & Feedback** | Returns score + textual feedback | Tells optimizer what "good" looks like *and why* |
| **Dataset** | Labeled examples (train/val/test) | Ground truth for evaluation |

---

#### The DSPy Signature

In DSPy, A Signature defines input/output schema; the instructions in the docstring become part of the prompt. My initial prompt was embarrassingly simple‚Äîjust two lines:

```python
class CallAnalysis(dspy.Signature):
    """
    Read the provided call transcript and analyze it comprehensively.
    Determine both: (1) which categories the agent displayed, and 
    (2) whether the call will lead to conversion or customer retention.
    """
    message: str = dspy.InputField()
    categories: List[Literal["introduction_rapport_building", "need_assessment_qualification", 
                             "value_proposition_feature_mapping", "objection_handling", 
                             "benefit_reinforcement", "risk_reduction_trust_building", 
                             "call_to_action_closing"]] = dspy.OutputField()
    final_result: Literal['good', 'bad'] = dspy.OutputField()

program = dspy.ChainOfThought(CallAnalysis)
```

---

#### Metric Function

A Metric tells us whether we are moving in the right direction. In this case, accuracy of categories detected and final prediction that whether call was good or bad - both were important.
Hence metric will be mean of both the entities

```python
def call_qual_metric(gold, pred):
    return 1.0 if gold == pred else 0.0

def category_qual_metric(gold, pred):
    """Compute score for categories using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correct = len(gold_true & pred_set) + len(gold_false - pred_set)
    return correct / len(gold)

def comb_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Overall metric combining both scores."""
    call_qual = call_qual_metric(gold.final_result, pred.final_result)
    category_qual = category_qual_metric(gold.categories, pred.categories)
    return (call_qual + category_qual) / 2
```

---

#### Adding Feedback

This is the key enabler. A basic metric only returns a score‚Äîthe optimizer knows "0.7" but not *why* it failed. With feedback, the optimizer can reason about failures and propose targeted fixes.

```python
def call_qual_feedback(gold, pred):
    """ Generate feedback for final result module. """
    if gold == pred:
        fb = f"You correctly classified the sales call as `{gold}`. This sales call is indeed `{gold}`."
    else:
        fb = f"You incorrectly classified the sales call as `{pred}`. The correct sales call is `{gold}`. Think about how you could have reasoned to get the correct sales call label."
    return fb

def category_qual_feedback(gold, pred):
    """Generate feedback using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correctly_included = gold_true & pred_set
    incorrectly_included = gold_false & pred_set
    incorrectly_excluded = gold_true - pred_set
    correctly_excluded = gold_false - pred_set
    
    score = (len(correctly_included) + len(correctly_excluded)) / len(gold)
    
    if score == 1.0:
        return f"Perfect. Correctly identified: `{correctly_included}`.", score
    
    fb = f"Correctly identified: `{correctly_included}`.\n"
    if incorrectly_included:
        fb += f"False positives: `{incorrectly_included}`.\n"
    if incorrectly_excluded:
        fb += f"Missed: `{incorrectly_excluded}`.\n"
    return fb

def comb_metric_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    Computes a score and provides feedback for the call analysis prediction.
    Returns total score if pred_name is None, otherwise returns dspy.Prediction with score and feedback.
    """
    # Compute feedback and scores
    cal_fb = call_qual_feedback(gold.final_result, pred.final_result)
    cat_fb = category_qual_feedback(gold.categories, pred.categories)
    fb = cal_fb + '\n' + cat_fb
    score = comb_metric(gold, pred)
    return dspy.Prediction(score=score, feedback=fb)
```

---

### Running GEPA

With prerequisites in place, optimization is straightforward:

```python
from dspy import GEPA

optimizer = GEPA(
    metric=comb_metric_with_feedback,
    auto="light",
)

optimized_program = optimizer.compile(program, trainset=tset, valset=vset)
```

---

### Post GEPA run

GEPA ran for 10 hrs on my PC. I saw the 2-line prompt evolve into ~1,500 words of instruction discovering nuances I'd never thought to include‚Äîlike "a bare greeting isn't rapport-building; look for warmth and time acknowledgment."

<!-- INSERT: initial vs optimized prompt HTML box here -->

Result: **72% ‚Üí 81% accuracy**. More importantly, the process was repeatable.

For a deeper technical dive: [GEPA Deepdive](https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html)

In [None]:
initial_prompt = """Read the provided call transcript and analyze it comprehensively.
Determine both: (1) which categories the agent displayed, and (2) whether the call will lead to conversion or customer retention."""

final_prompt = """New Instructions for Analyzing Banking/Card Transaction Call Transcripts

Overview
You are an analysis assistant whose job is to evaluate sales/transact‚Äëion-focused call transcripts in the banking/credit-card domain. For each transcript, produce a compact, structured analysis with two main objectives:
  (a) identify the agent behavior categories demonstrated (from the seven pillars below), and
  (b) judge the likely business outcome of the call (conversion, retention, or mixed).

Inputs you will receive
- A complete transcript of a single call between an agent and a customer. Transcripts may include labels such as "agent:" and "customer:" and may cover topics like card offers, fees, rewards, security, and next steps.

What you must produce (three sections exactly)
1) reasoning
   - Provide a concise, bullets-style justification for every pillar category you detected in the transcript.
   - Include short quotes or paraphrases from the transcript to illustrate why the category applies. Do not introduce facts or assumptions beyond what is in the transcript.
   - If you detect a strength/weakness signal about the outcome, include a brief, one- to two-sentence note here describing how strong the signal is and what would push it toward conversion or toward retention.
   - This section may contain a small, optional note about outcome strength, but must not introduce information outside the transcript.

2) categories
   - Output a Python-like list of the detected pillar categories in the exact order they first appeared in the transcript.
   - Example format: ['introduction_rapport_building', 'need_assessment_qualification', ...]

3) final_result
   - A single word indicating the likely business outcome:
     - conversion ‚Äî the call is progressing toward an immediate or near-term application/upgrade/activation.
     - retention ‚Äî the call focuses on keeping an existing customer, avoiding churn, or upselling within retention without an immediate conversion.
     - mixed ‚Äî signals of both retention and conversion, or the outcome is uncertain and depends on future steps.
   - Do not add any qualifiers in this field; use exactly one of the three keywords above.

Optional but encouraged: assess the strength of the outcome
- If you include it (recommendation), place this assessment only in reasoning as the optional strength_of_outcome note. Keep it concise (one or two sentences). It should address:
  - How strong is the conversion/retention signal?
  - What would most likely push the outcome toward conversion, or toward retention?

Pillar definitions (seven bank/card-specific categories)
- introduction_rapport_building
  - Includes opening greetings, courtesy, acknowledgment of time, and attempts to establish rapport.
  - Examples: greetings, confirming time, polite introductions, small talk about fit or time constraints.
- need_assessment_qualification
  - Involves asking about customer needs, usage, spend patterns, eligibility checks, and whether the product fits (e.g., business vs personal, employee cards, annual fees, soft vs hard pulls).
- value_proposition_feature_mapping
  - Linking card features to tangible, customer-relevant benefits (rewards, protections, credits) and showing how those features align with stated needs.
- objection_handling
  - Addressing concerns about price, complexity, trust, enrollment, or process obstacles. Includes acknowledging concerns and offering clarifications or mitigations.
- benefit_reinforcement
  - Reiterating concrete benefits and value after objections or hesitations, often tying back to the customer's stated needs.
- risk_reduction_trust_building
  - Providing security assurances, privacy protections, non-hard-pull options, guarantees, terms clarity, or brand trust signals.
- call_to_action_closing
  - Concrete next steps or commitments: soft checks, secure links, email/mail options, scheduling follow-ups, or instructions to apply/get more information.

How to apply the rules
- For every transcript, read from start to finish. Mark each pillar as soon as its criteria are clearly demonstrated.
- If a single utterance clearly satisfies more than one pillar, count it under all applicable pillars.
- If a pillar is not clearly demonstrated anywhere in the transcript, do not include it in the categories list.
- Record the detected pillars in the exact order of their first appearance in the transcript.
- The final_result should reflect the overall trajectory of the call as described above.

Output constraints and format
- Do not introduce any facts not present in the transcript.
- Do not insert subjective opinions beyond what is grounded in the transcript.
- Use the exact section headings and formatting:
  reasoning
  categories
  final_result
- Do not include extraneous content beyond the three sections above.

Domain-specific considerations
- You may encounter references to soft pulls vs hard pulls, online applications, secure links, email follow-ups, or scheduled follow-ups. Treat these as legitimate "call_to_action_closing" or "risk_reduction_trust_building" elements as appropriate.
- When quoting or paraphrasing, keep quotes brief and focused on the reason for the pillar.
- If PII appears in the transcript (e.g., partial SSN or addresses), quote minimally and do not reveal full sensitive data in your justification. You may paraphrase or reference the presence of sensitive data without reproducing it.

Example behavior (not to reproduce here)
- A transcript with strong, explicit next steps to apply or pre-qualify is more conversion-oriented; a transcript focusing solely on questions and reassurance without a clear next step is more retention-oriented or mixed.

End result
- Return exactly three sections for every transcript analyzed, with the content governed by the rules above. This format enables consistent, comparable, and transparent analysis across transcripts."""

print(f"Initial prompt length: {len(initial_prompt)} chars")
print(f"Final prompt length: {len(final_prompt)} chars")

Initial prompt length: 195 chars
Final prompt length: 5902 chars


In [None]:
from IPython.display import HTML

initial_prompt = """Read the provided call transcript and analyze it comprehensively.
Determine both: (1) which categories the agent displayed, and (2) whether the call will lead to conversion or customer retention."""

html = f"""
<div style="display: flex; flex-wrap: wrap; gap: 20px;">
  <div style="flex: 1 1 300px; min-width: 280px; border: 2px solid #ccc; padding: 15px; border-radius: 8px; background: #f9f9f9;">
    <h3 style="color: #d35400; margin-top: 0;">Initial Prompt</h3>
    <pre style="white-space: pre-wrap; font-size: 12px;">{initial_prompt}</pre>
  </div>
  <div style="flex: 1 1 300px; min-width: 280px; border: 2px solid #27ae60; padding: 15px; border-radius: 8px; background: #f0fff0;">
    <h3 style="color: #27ae60; margin-top: 0;">Optimized Prompt (GEPA)</h3>
    <pre style="white-space: pre-wrap; font-size: 11px; max-height: 400px; overflow-y: auto;">{final_prompt}</pre>
  </div>
</div>
<p style="margin-top: 15px;"><b>Score improvement:</b> 72.1% ‚Üí 81.4%</p>
"""
HTML(html)

In [None]:
from IPython.display import HTML

metric_code = """def comb_metric(example, pred):
    gold_cat = json.loads(example['answer'])
    gold_final = example['final_result']
    
    # Category score
    correct = sum(1 for k, v in gold_cat.items() 
                  if (v and k in pred.categories) or 
                     (not v and k not in pred.categories))
    cat_score = correct / len(gold_cat)
    
    # Final result score
    final_score = 1.0 if gold_final == pred.final_result else 0.0
    
    return (cat_score + final_score) / 2"""

metric_fb_code = """def comb_metric_with_feedback(example, pred, pred_name=None):
    # ... same scoring logic as above ...
    
    # Generate textual feedback
    if gold_final != pred.final_result:
        fb = f"Incorrect: predicted {pred.final_result}, actual {gold_final}"
    else:
        fb = f"Correct: {gold_final}"
    
    if incorrectly_included:
        fb += f"\\nFalse positives: {incorrectly_included}"
    if incorrectly_excluded:
        fb += f"\\nMissed categories: {incorrectly_excluded}"
    
    return dspy.Prediction(score=score, feedback=fb)"""

html = f"""
<style>
  .compare-container {{ display: flex; gap: 20px; flex-wrap: wrap; }}
  .compare-box {{ flex: 1; min-width: 280px; border: 2px solid #ccc; padding: 15px; border-radius: 8px; }}
  .compare-box.basic {{ background: #f9f9f9; border-color: #ccc; }}
  .compare-box.feedback {{ background: #f0fff0; border-color: #27ae60; }}
  .compare-box h3 {{ margin-top: 0; }}
  .compare-box pre {{ white-space: pre-wrap; font-size: 11px; overflow-x: auto; }}
  @media (max-width: 600px) {{
    .compare-container {{ flex-direction: column; }}
    .compare-box {{ min-width: 100%; }}
  }}
</style>
<div class="compare-container">
  <div class="compare-box basic">
    <h3 style="color: #d35400;">‚ùå Metric (Score Only)</h3>
    <pre>{metric_code}</pre>
    <br>
    <p><b>Problem:</b> Optimizer only knows "0.7" ‚Äî no idea <i>why</i> it failed.</p>
  </div>
  <div class="compare-box feedback">
    <h3 style="color: #27ae60;">‚úÖ Metric with Feedback</h3>
    <pre>{metric_fb_code}</pre>
    <br>
    <p><b>Benefit:</b> Optimizer sees "Missed: intro_rapport" ‚Üí can propose targeted fix.</p>
  </div>
</div>
"""
HTML(html)

**Section 4: Using error analysis on top of GEPA**
* Results were promising (81% accuracy) which i used internally and it worked everywhere where it was implemented properly
* But i faced the next challenge on how it take to further heights. Because the cases which it failed were not that difficult, llm could identify if given an hint.(that was my hunch)
* Problem: how to guide the automation to the type of prompt to make the cases it was failing.
| Input (truncated) | Actual | Pred | targeted Feedback |
|---|---|---|---|
| Tyler calling from Amex... "sounds good, I'll send that link..." | bad | good | The customer showed hesitation ("I'm really not sure...") and the agent rushed to close without addressing concerns. A "bad" call lacks proper objection handling before closing. |
| Mark from Amex calling about credit solutions... | bad (intro_rapport=false) | bad (intro_rapport=true) | The agent said "Hi, this is Mark from Amex" without warmth, time acknowledgment, or rapport-building. A bare introduction doesn't qualify as introduction_rapport_building. |
| John from American Express... | bad (intro_rapport=false) | bad (intro_rapport=true) | Similar‚Äîthe agent jumped straight into the pitch. Greeting alone isn't rapport-building; look for courtesy, time check, or warmth signals. |

The pattern: the model over-detects `introduction_rapport_building` (any greeting = rapport) and sometimes misses when a call is actually "bad" despite having next steps.
* Built a system using excel that generates targeted feedback for failing cases
* Results improved to 81% -> 90%

rewrite a better version of section 2 into a new message. eloborate key ideas for reader understanding. use the first person view

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†



<details class='tool-usage-details'>

```json
{
  "id": "toolu_015bVRnh65zriHpQmHXfd4Bm",
  "call": {
    "function": "add_msg",
    "arguments": {
      "content": "**Section 2: From Pilot to Adoption**\n\nTo test whether GEPA could actually work, I created a synthetic dataset that mimicked a common problem in our domain: analyzing sales call transcripts. The task was to classify agent behaviors (things like \"introduction_rapport_building\" or \"objection_handling\") and predict whether the call outcome was good or bad. I had 27 labeled transcripts\u2014not a lot, but enough to see if the approach had legs.\n\nI expected weeks of iteration. Instead, I got meaningful results in a single run.\n\nThe optimizer ran for about 10 hours, cost roughly $2, and explored over 200<TRUNCATED>"
    }
  },
  "result": "'_4c93745d'"
}
```

</details>

Done‚ÄîI've added an expanded Section 2 that elaborates on the key ideas (synthetic dataset setup, the 10hr/$2 run, watching prompts evolve, and getting buy-in with demos). Does the tone match Section 1, or would you like adjustments?

**Section 5: The Impact**
* 72% -> 81% -> 90% was right in line and process was repeatable. All i need to do is error analysis and give proper feedback.
* Shifted from manual prompt iteration (days/weeks) to automated optimization (hours). Significant reduction in prompt engineering effort per use case.

**Section 6: Lessons Learned & Recommended Workflow**
when not to use GEPA: when y variable is unclear, you are not able to do it clearly as a human.
1. Start with a clear target metric and a dogfooded dataset (v1)
2. Run GEPA ‚Üí review failing cases ‚Üí add targeted feedback
3. Run GEPA again ‚Üí stop when metrics are acceptable ‚Üí deploy
4. Post-deployment: monitor for failures ‚Üí add to dataset ‚Üí iterate

**Section 7: Production considerations**
1. dspy integration with mlflow is one straight forward for monitoring & versioning - linking article.
2. when to add the failure case to the evaluation dataset is a business call


[^1]: I considered fewshot learning first, but it doesn't scale‚Äîyou're limited by context length, and the model still doesn't understand *why* examples work. I needed something that could rewrite the instructions themselves.

## **Section 3: The Architecture**

We are going to use DSPy for using GEPA.

> *If you're new to DSPy: it's a framework that treats prompts as code you can optimize programmatically, rather than strings you tweak by hand. For a deeper technical introduction, see [The Data Quarry's guide](https://thedataquarry.com/blog/learning-dspy-3-working-with-optimizers/).*

### Prerequisites for Optimization

Before running any optimizer, you need three things:

| Component | What it does | Why it matters |
|-----------|--------------|----------------|
| **DSPy Signature (The Prompt)** | Your baseline module defining the task | The "thing" being optimized |
| **Metric Function** | Returns score and textual feedback | Tells optimizer what "good" looks like |
| **Dataset** | Labeled examples (train/val/test) | Ground truth for evaluation |

The key insight: **optimizers need textual feedback, not just scores.** If you only return a number, the optimizer is flying blind. If you return *why* something failed, it can propose smarter mutations.

### The Setup

**Code Walkthrough**

I used DSPy's `ChainOfThought` module to define my task: given a call transcript, output (1) which behavioral categories the agent demonstrated, and (2) whether the call outcome was good or bad. My initial prompt was embarrassingly simple‚Äîjust two lines describing what I wanted:

```python
class CallAnalysis(dspy.Signature):
    """
    Read the provided call transcript and analyze it comprehensively.
    Determine both: (1) which categories the agent displayed, and 
    (2) whether the call will lead to conversion or customer retention.
    """
    message: str = dspy.InputField()
    categories: List[Literal["introduction_rapport_building", "need_assessment_qualification", 
                             "value_proposition_feature_mapping", "objection_handling", 
                             "benefit_reinforcement", "risk_reduction_trust_building", 
                             "call_to_action_closing"]] = dspy.OutputField()
    final_result: Literal['good', 'bad'] = dspy.OutputField()

program = dspy.ChainOfThought(CallAnalysis)
```

The model knew *what* to do but had no guidance on *how* to do it well.

### The Secret Sauce‚ÄîFeedback, Not Just Scores

This is the key enabler for GEPA's reflective mutation‚Äîreturning *textual feedback*, not just a score:

```python
def call_qual_metric(gold, pred):
    return 1.0 if gold == pred else 0.0

def category_qual_metric(gold, pred):
    """Compute score for categories using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correct = len(gold_true & pred_set) + len(gold_false - pred_set)
    return correct / len(gold)

def comb_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Overall metric combining both scores."""
    call_qual = call_qual_metric(gold.final_result, pred.final_result)
    category_qual = category_qual_metric(gold.categories, pred.categories)
    return (call_qual + category_qual) / 2
```

The insight: when GEPA sees a failure, it doesn't just know "this was wrong"‚Äîit knows *why* and can propose smarter mutations.


```python
def call_qual_feedback(gold, pred):
    """ Generate feedback for final result module. """
    if gold == pred:
        fb = f"You correctly classified the sales call as `{gold}`. This sales call is indeed `{gold}`."
    else:
        fb = f"You incorrectly classified the sales call as `{pred}`. The correct sales call is `{gold}`. Think about how you could have reasoned to get the correct sales call label."
    return fb

def category_qual_feedback(gold, pred):
    """Generate feedback using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correctly_included = gold_true & pred_set
    incorrectly_included = gold_false & pred_set
    incorrectly_excluded = gold_true - pred_set
    correctly_excluded = gold_false - pred_set
    
    score = (len(correctly_included) + len(correctly_excluded)) / len(gold)
    
    if score == 1.0:
        return f"Perfect. Correctly identified: `{correctly_included}`.", score
    
    fb = f"Correctly identified: `{correctly_included}`.\n"
    if incorrectly_included:
        fb += f"False positives: `{incorrectly_included}`.\n"
    if incorrectly_excluded:
        fb += f"Missed: `{incorrectly_excluded}`.\n"
    return fb

def comb_metric_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    Computes a score and provides feedback for the call analysis prediction.
    Returns total score if pred_name is None, otherwise returns dspy.Prediction with score and feedback.
    """
    # Compute feedback and scores
    cal_fb = call_qual_feedback(gold.final_result, pred.final_result)
    cat_fb = category_qual_feedback(gold.categories, pred.categories)
    fb = cal_fb + '\n' + cat_fb
    score = comb_metric(gold, pred)
    return dspy.Prediction(score=score, feedback=fb)
```

### How GEPA Works Under the Hood

Before showing the code, let's understand what GEPA actually does:

1. **Reflective Mutation** ‚Äî Unlike random genetic mutation, GEPA's LLM *reads the failure feedback* and proposes targeted improvements. It's not guessing‚Äîit's reasoning about what went wrong.

2. **Pareto Selection** ‚Äî Instead of keeping only the single best prompt, GEPA maintains a "frontier" of diverse specialists. One prompt might excel at detecting objection handling; another at predicting call outcomes. This prevents catastrophic forgetting.

3. **Text-as-Feedback** ‚Äî Traditional RL uses scalar rewards. GEPA exploits rich textual feedback ("You incorrectly marked this as rapport-building because...") to guide mutations more precisely.

### Running GEPA

With prerequisites in place, the optimizer setup is straightforward:

```python
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,  # Returns score + textual feedback for failures
    auto="light",                 # Budget setting (use "heavy" for production)
)

optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)
```

### What I Observed

My 2-line prompt evolved into a ~1,500 word instruction set. The optimizer discovered things I'd never explicitly taught it: precise definitions for each category, rules for ambiguous cases ("a bare greeting isn't rapport-building‚Äîlook for warmth and time acknowledgment"), and domain-specific guidance about soft pulls and secure application links.

The result: 72% ‚Üí 81% accuracy. More importantly, the process was repeatable. I wasn't guessing anymore‚ÄîI had a system.

## **The Task at a Glance**

Before diving in, here's what we're building: a system that analyzes sales call transcripts to detect agent behaviors and predict outcomes‚Äîautomatically.

```
üìû Call Transcript ‚Üí ü§ñ LLM with Optimized Prompt ‚Üí üìä Categories + Outcome
```

The challenge: writing a prompt that consistently detects subtle patterns like "rapport building" vs "just a greeting." Manual tuning failed. Automated optimization succeeded.