## Article: How I Unlocked Tremendous Value by Automating Prompt Engineering at Enterprise Scale

## **The Task at a Glance**

Before diving in, here's what we're building: a system that analyzes sales call transcripts to detect agent behaviors and predict outcomes‚Äîautomatically.

```
üìû Call Transcript ‚Üí ü§ñ LLM with Optimized Prompt ‚Üí üìä Categories + Outcome
```

The challenge: writing a prompt that consistently detects subtle patterns like "rapport building" vs "just a greeting." Manual tuning failed. Automated optimization succeeded.

## **Section 1: The Problem**

I'd built plenty of successful POCs with LLM prompts‚Äîthey always looked impressive in demos. But when it came time to deploy at scale, everything fell apart. Manual prompt engineering is fundamentally non-deterministic: what works today might fail tomorrow, and there's no systematic way to improve it.

My challenge was unlocking value from years of unstructured customer interaction data. I'd spent three weeks fine-tuning a single prompt by hand, iterating through variations, hoping to stumble on something that worked consistently. It wasn't cutting it.

**Then it hit me:** LLMs are remarkably good at *giving feedback*. They can look at their own outputs and tell you exactly what went wrong. So why couldn't they convert that feedback into better prompts automatically? Something should exist already, right?

I started reading everything I could find on the topic. That's when I stumbled upon DSPy and the concept of prompt optimizers‚Äîsystems that treat prompt engineering as an optimization problem rather than an art. It clicked immediately. When GEPA (Genetic-Pareto Algorithm) was released, I knew I had to test it.[^1]

Think of it like compilation: you write high-level code (your task definition), and the compiler transforms it into optimized machine instructions (a battle-tested prompt). You don't hand-tune assembly‚Äîso why hand-tune prompts?

## **Section 2: The Pilot**

I needed a simple test to see if it can work for my use case. So I created a synthetic dataset of 27 sales call transcripts that represented a real challenge we face: detecting presence of required behaviors and predicting call quality (good/bad). The transcripts were hand-labeled across 7 behavior categories `(introduction/rapport building, needs assessment, value proposition mapping, objection handling, benefit reinforcement, risk reduction, and call-to-action closing)`. Small enough to iterate fast, realistic enough to validate the approach.



In [None]:
%%html
<div style="display:flex; gap:20px; font-family:system-ui,-apple-system,sans-serif;">
  <div style="flex:1; background:#f8f9fa; border-radius:12px; padding:20px; border:1px solid #e0e0e0;">
    <h3 style="margin:0 0 12px 0; color:#1a73e8; font-size:14px; text-transform:uppercase; letter-spacing:1px;">üìû Input: Call Transcript</h3>
    <div style="background:white; padding:16px; border-radius:8px; font-size:13px; line-height:1.6; max-height:300px; overflow-y:auto; white-space:pre-wrap; color:#333;">agent: Hi, good afternoon ‚Äî this is Maya calling from American Express. Am I speaking with Jordan Lee?

customer: Yes, this is Jordan.

agent: Great, Jordan. How are you doing today?

customer: I'm good, thanks. Busy afternoon, but I have a few minutes.

agent: I appreciate you taking the time...</div>
  </div>
  <div style="flex:1; background:#f0f7f0; border-radius:12px; padding:20px; border:1px solid #c8e6c9;">
    <h3 style="margin:0 0 12px 0; color:#2e7d32; font-size:14px; text-transform:uppercase; letter-spacing:1px;">üìä Output: Analysis</h3>
    <div style="background:white; padding:16px; border-radius:8px; margin-bottom:12px;">
      <div style="font-size:12px; color:#666; margin-bottom:4px;">Call Quality</div>
      <div style="font-size:24px; font-weight:600; color:#2e7d32;">‚úì Good</div>
    </div>
    <div style="background:white; padding:16px; border-radius:8px;">
      <div style="font-size:12px; color:#666; margin-bottom:8px;">Detected Categories</div>
      <div style="display:flex; flex-wrap:wrap; gap:6px;">
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Introduction/Rapport ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Need Assessment ‚úì</span>
        <span style="background:#ffebee; color:#c62828; padding:4px 10px; border-radius:16px; font-size:12px;">Value Proposition ‚úó</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Objection Handling ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Benefit Reinforcement ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Risk Reduction ‚úì</span>
        <span style="background:#e8f5e9; color:#2e7d32; padding:4px 10px; border-radius:16px; font-size:12px;">Call to Action ‚úì</span>
      </div>
    </div>
  </div>
</div>

I expected weeks of iteration. Instead, I got meaningful results in a single run.

### Results

| Approach | Cost | Time | Accuracy |
|----------|------|------|----------|
| Manual prompt engineering | $100-1000 (engineer time) | Days to weeks | 72% |
| **GEPA optimization** | ~$2 | 10 hours | 81% |
| GEPA optimization with some basic error analysis | ~$0.5 | 3 hours | 90% |

The optimizer ran for about 10 hours, cost roughly $2, and explored over 200 prompt variants. Through genetic mutation and Pareto selection, it whittled those down to 9 "survivors"‚Äîprompts that excelled at different subsets of the problem. The best performer jumped from 72% to 81% accuracy, a lift I hadn't achieved in months of manual tuning.

What really convinced me wasn't just the final number‚Äîit was watching the intermediate prompts evolve. I could see the optimizer discovering nuances I'd never thought to include: explicit definitions for each category, step-by-step rules for edge cases, domain-specific guidance about soft pulls versus hard pulls. The quality of the reasoning it produced while iterating was genuinely impressive.

Getting buy-in was straightforward after that. I showed a side-by-side demo: baseline prompt versus optimized prompt, same test cases, measurable improvement. Numbers speak louder than theory.

## **Section 3: The Architecture**

We'll use DSPy to run GEPA. If you're new to DSPy, it's a framework that treats prompts as code you can optimize programmatically. For background, see [The Data Quarry's guide](https://thedataquarry.com/blog/learning-dspy-3-working-with-optimizers/).

---

### How GEPA Works

GEPA (Genetic-Pareto Algorithm) differs from traditional optimization in three key ways:

1. **Reflective Mutation** ‚Äî The LLM *reads failure feedback* and proposes targeted improvements. It's not random guessing‚Äîit's reasoning about what went wrong.

2. **Pareto Selection** ‚Äî Instead of keeping only the single best prompt, GEPA maintains a "frontier" of diverse specialists. One prompt might excel at detecting objection handling; another at predicting outcomes. This prevents catastrophic forgetting.

3. **Text-as-Feedback** ‚Äî Traditional RL uses scalar rewards. GEPA exploits rich textual feedback ("You incorrectly marked this as rapport-building because...") to guide mutations precisely.

---

### Prerequisites

| Component | What it does | Why it matters |
|-----------|--------------|----------------|
| **DSPy Signature** | Your baseline module defining the task | The "thing" being optimized |
| **Metric Function** | Returns score + textual feedback | Tells optimizer what "good" looks like *and why* |
| **Dataset** | Labeled examples (train/val/test) | Ground truth for evaluation |

---

### The DSPy Setup

My initial prompt was embarrassingly simple‚Äîjust two lines:

```python
class CallAnalysis(dspy.Signature):
    """
    Read the provided call transcript and analyze it comprehensively.
    Determine both: (1) which categories the agent displayed, and 
    (2) whether the call will lead to conversion or customer retention.
    """
    message: str = dspy.InputField()
    categories: List[Literal["introduction_rapport_building", ...]] = dspy.OutputField()
    final_result: Literal['good', 'bad'] = dspy.OutputField()

program = dspy.ChainOfThought(CallAnalysis)
```

The model knew *what* to do but had no guidance on *how*.

---

### Adding Feedback

This is the key enabler. A basic metric only returns a score‚Äîthe optimizer knows "0.7" but not *why* it failed. With feedback, the optimizer can reason about failures and propose targeted fixes.

In [None]:
initial_prompt = """Read the provided call transcript and analyze it comprehensively.
Determine both: (1) which categories the agent displayed, and (2) whether the call will lead to conversion or customer retention."""

final_prompt = """New Instructions for Analyzing Banking/Card Transaction Call Transcripts

Overview
You are an analysis assistant whose job is to evaluate sales/transact‚Äëion-focused call transcripts in the banking/credit-card domain. For each transcript, produce a compact, structured analysis with two main objectives:
  (a) identify the agent behavior categories demonstrated (from the seven pillars below), and
  (b) judge the likely business outcome of the call (conversion, retention, or mixed).

Inputs you will receive
- A complete transcript of a single call between an agent and a customer. Transcripts may include labels such as "agent:" and "customer:" and may cover topics like card offers, fees, rewards, security, and next steps.

What you must produce (three sections exactly)
1) reasoning
   - Provide a concise, bullets-style justification for every pillar category you detected in the transcript.
   - Include short quotes or paraphrases from the transcript to illustrate why the category applies. Do not introduce facts or assumptions beyond what is in the transcript.
   - If you detect a strength/weakness signal about the outcome, include a brief, one- to two-sentence note here describing how strong the signal is and what would push it toward conversion or toward retention.
   - This section may contain a small, optional note about outcome strength, but must not introduce information outside the transcript.

2) categories
   - Output a Python-like list of the detected pillar categories in the exact order they first appeared in the transcript.
   - Example format: ['introduction_rapport_building', 'need_assessment_qualification', ...]

3) final_result
   - A single word indicating the likely business outcome:
     - conversion ‚Äî the call is progressing toward an immediate or near-term application/upgrade/activation.
     - retention ‚Äî the call focuses on keeping an existing customer, avoiding churn, or upselling within retention without an immediate conversion.
     - mixed ‚Äî signals of both retention and conversion, or the outcome is uncertain and depends on future steps.
   - Do not add any qualifiers in this field; use exactly one of the three keywords above.

Optional but encouraged: assess the strength of the outcome
- If you include it (recommendation), place this assessment only in reasoning as the optional strength_of_outcome note. Keep it concise (one or two sentences). It should address:
  - How strong is the conversion/retention signal?
  - What would most likely push the outcome toward conversion, or toward retention?

Pillar definitions (seven bank/card-specific categories)
- introduction_rapport_building
  - Includes opening greetings, courtesy, acknowledgment of time, and attempts to establish rapport.
  - Examples: greetings, confirming time, polite introductions, small talk about fit or time constraints.
- need_assessment_qualification
  - Involves asking about customer needs, usage, spend patterns, eligibility checks, and whether the product fits (e.g., business vs personal, employee cards, annual fees, soft vs hard pulls).
- value_proposition_feature_mapping
  - Linking card features to tangible, customer-relevant benefits (rewards, protections, credits) and showing how those features align with stated needs.
- objection_handling
  - Addressing concerns about price, complexity, trust, enrollment, or process obstacles. Includes acknowledging concerns and offering clarifications or mitigations.
- benefit_reinforcement
  - Reiterating concrete benefits and value after objections or hesitations, often tying back to the customer's stated needs.
- risk_reduction_trust_building
  - Providing security assurances, privacy protections, non-hard-pull options, guarantees, terms clarity, or brand trust signals.
- call_to_action_closing
  - Concrete next steps or commitments: soft checks, secure links, email/mail options, scheduling follow-ups, or instructions to apply/get more information.

How to apply the rules
- For every transcript, read from start to finish. Mark each pillar as soon as its criteria are clearly demonstrated.
- If a single utterance clearly satisfies more than one pillar, count it under all applicable pillars.
- If a pillar is not clearly demonstrated anywhere in the transcript, do not include it in the categories list.
- Record the detected pillars in the exact order of their first appearance in the transcript.
- The final_result should reflect the overall trajectory of the call as described above.

Output constraints and format
- Do not introduce any facts not present in the transcript.
- Do not insert subjective opinions beyond what is grounded in the transcript.
- Use the exact section headings and formatting:
  reasoning
  categories
  final_result
- Do not include extraneous content beyond the three sections above.

Domain-specific considerations
- You may encounter references to soft pulls vs hard pulls, online applications, secure links, email follow-ups, or scheduled follow-ups. Treat these as legitimate "call_to_action_closing" or "risk_reduction_trust_building" elements as appropriate.
- When quoting or paraphrasing, keep quotes brief and focused on the reason for the pillar.
- If PII appears in the transcript (e.g., partial SSN or addresses), quote minimally and do not reveal full sensitive data in your justification. You may paraphrase or reference the presence of sensitive data without reproducing it.

Example behavior (not to reproduce here)
- A transcript with strong, explicit next steps to apply or pre-qualify is more conversion-oriented; a transcript focusing solely on questions and reassurance without a clear next step is more retention-oriented or mixed.

End result
- Return exactly three sections for every transcript analyzed, with the content governed by the rules above. This format enables consistent, comparable, and transparent analysis across transcripts."""

print(f"Initial prompt length: {len(initial_prompt)} chars")
print(f"Final prompt length: {len(final_prompt)} chars")

Initial prompt length: 195 chars
Final prompt length: 5902 chars


In [None]:
from IPython.display import HTML

initial_prompt = """Read the provided call transcript and analyze it comprehensively.
Determine both: (1) which categories the agent displayed, and (2) whether the call will lead to conversion or customer retention."""

html = f"""
<div style="display: flex; flex-wrap: wrap; gap: 20px;">
  <div style="flex: 1 1 300px; min-width: 280px; border: 2px solid #ccc; padding: 15px; border-radius: 8px; background: #f9f9f9;">
    <h3 style="color: #d35400; margin-top: 0;">Initial Prompt</h3>
    <pre style="white-space: pre-wrap; font-size: 12px;">{initial_prompt}</pre>
  </div>
  <div style="flex: 1 1 300px; min-width: 280px; border: 2px solid #27ae60; padding: 15px; border-radius: 8px; background: #f0fff0;">
    <h3 style="color: #27ae60; margin-top: 0;">Optimized Prompt (GEPA)</h3>
    <pre style="white-space: pre-wrap; font-size: 11px; max-height: 400px; overflow-y: auto;">{final_prompt}</pre>
  </div>
</div>
<p style="margin-top: 15px;"><b>Score improvement:</b> 72.1% ‚Üí 81.4%</p>
"""
HTML(html)

---

### Running GEPA

With prerequisites in place, optimization is straightforward:

```python
from dspy import GEPA

optimizer = GEPA(
    metric=comb_metric_with_feedback,
    auto="light",
)

optimized_program = optimizer.compile(program, trainset=train_set, valset=val_set)
```

---

### What I Observed

My 2-line prompt evolved into ~1,500 words of instruction:

In [None]:
from IPython.display import HTML

metric_code = """def comb_metric(example, pred):
    gold_cat = json.loads(example['answer'])
    gold_final = example['final_result']
    
    # Category score
    correct = sum(1 for k, v in gold_cat.items() 
                  if (v and k in pred.categories) or 
                     (not v and k not in pred.categories))
    cat_score = correct / len(gold_cat)
    
    # Final result score
    final_score = 1.0 if gold_final == pred.final_result else 0.0
    
    return (cat_score + final_score) / 2"""

metric_fb_code = """def comb_metric_with_feedback(example, pred, pred_name=None):
    # ... same scoring logic as above ...
    
    # Generate textual feedback
    if gold_final != pred.final_result:
        fb = f"Incorrect: predicted {pred.final_result}, actual {gold_final}"
    else:
        fb = f"Correct: {gold_final}"
    
    if incorrectly_included:
        fb += f"\\nFalse positives: {incorrectly_included}"
    if incorrectly_excluded:
        fb += f"\\nMissed categories: {incorrectly_excluded}"
    
    return dspy.Prediction(score=score, feedback=fb)"""

html = f"""
<style>
  .compare-container {{ display: flex; gap: 20px; flex-wrap: wrap; }}
  .compare-box {{ flex: 1; min-width: 280px; border: 2px solid #ccc; padding: 15px; border-radius: 8px; }}
  .compare-box.basic {{ background: #f9f9f9; border-color: #ccc; }}
  .compare-box.feedback {{ background: #f0fff0; border-color: #27ae60; }}
  .compare-box h3 {{ margin-top: 0; }}
  .compare-box pre {{ white-space: pre-wrap; font-size: 11px; overflow-x: auto; }}
  @media (max-width: 600px) {{
    .compare-container {{ flex-direction: column; }}
    .compare-box {{ min-width: 100%; }}
  }}
</style>
<div class="compare-container">
  <div class="compare-box basic">
    <h3 style="color: #d35400;">‚ùå Metric (Score Only)</h3>
    <pre>{metric_code}</pre>
    <br>
    <p><b>Problem:</b> Optimizer only knows "0.7" ‚Äî no idea <i>why</i> it failed.</p>
  </div>
  <div class="compare-box feedback">
    <h3 style="color: #27ae60;">‚úÖ Metric with Feedback</h3>
    <pre>{metric_fb_code}</pre>
    <br>
    <p><b>Benefit:</b> Optimizer sees "Missed: intro_rapport" ‚Üí can propose targeted fix.</p>
  </div>
</div>
"""
HTML(html)

<!-- INSERT: initial vs optimized prompt HTML box here -->

The optimizer discovered nuances I'd never thought to include‚Äîlike "a bare greeting isn't rapport-building; look for warmth and time acknowledgment."

Result: **72% ‚Üí 81% accuracy**. More importantly, the process was repeatable.

For a deeper technical dive: [GEPA Deepdive](https://risheekkumar.in/posts/gepa-deepdive/gepa_final_article.html)

**Section 4: Using error analysis on top of GEPA**
* Results were promising (81% accuracy) which i used internally and it worked everywhere where it was implemented properly
* But i faced the next challenge on how it take to further heights. Because the cases which it failed were not that difficult, llm could identify if given an hint.(that was my hunch)
* Problem: how to guide the automation to the type of prompt to make the cases it was failing.
| Input (truncated) | Actual | Pred | targeted Feedback |
|---|---|---|---|
| Tyler calling from Amex... "sounds good, I'll send that link..." | bad | good | The customer showed hesitation ("I'm really not sure...") and the agent rushed to close without addressing concerns. A "bad" call lacks proper objection handling before closing. |
| Mark from Amex calling about credit solutions... | bad (intro_rapport=false) | bad (intro_rapport=true) | The agent said "Hi, this is Mark from Amex" without warmth, time acknowledgment, or rapport-building. A bare introduction doesn't qualify as introduction_rapport_building. |
| John from American Express... | bad (intro_rapport=false) | bad (intro_rapport=true) | Similar‚Äîthe agent jumped straight into the pitch. Greeting alone isn't rapport-building; look for courtesy, time check, or warmth signals. |

The pattern: the model over-detects `introduction_rapport_building` (any greeting = rapport) and sometimes misses when a call is actually "bad" despite having next steps.
* Built a system using excel that generates targeted feedback for failing cases
* Results improved to 81% -> 90%

**Section 5: The Impact**
* 72% -> 81% -> 90% was right in line and process was repeatable. All i need to do is error analysis and give proper feedback.
* Shifted from manual prompt iteration (days/weeks) to automated optimization (hours). Significant reduction in prompt engineering effort per use case.

**Section 6: Lessons Learned & Recommended Workflow**
when not to use GEPA: when y variable is unclear, you are not able to do it clearly as a human.
1. Start with a clear target metric and a dogfooded dataset (v1)
2. Run GEPA ‚Üí review failing cases ‚Üí add targeted feedback
3. Run GEPA again ‚Üí stop when metrics are acceptable ‚Üí deploy
4. Post-deployment: monitor for failures ‚Üí add to dataset ‚Üí iterate

**Section 7: Production considerations**
1. dspy integration with mlflow is one straight forward for monitoring & versioning - linking article.
2. when to add the failure case to the evaluation dataset is a business call


[^1]: I considered fewshot learning first, but it doesn't scale‚Äîyou're limited by context length, and the model still doesn't understand *why* examples work. I needed something that could rewrite the instructions themselves.