# Optimize a prompt that maximizes win rate

Given
- a certified judge classifier `JUDGE_CLASSIFICATION_PROMPT: str`
- some input string data samples `INPUT_DATA_SAMPLES: list[str]`
- initial prompt `INITIAL_GENERATION_PROMPT: str`

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Example use case presented in this notebook

Given a Quora answer, write followup questions to the answer.

Great followup questions should be appealing to respond to and answers should be appealing to read.

## Foreword

Think of how the project manager interacts with the prompt engineer.

The project manager does not need to provide examples of good output, just the criteria of what good output should be.

Inspirations

- Anthropic workbench https://console.anthropic.com/workbench/
- Cohere prompt tuner https://cohere.com/blog/intro-prompt-tuner

## Frequently asked questions

Why don't you use the judge prompt in the generation prompt?
- You can if you want. I expect models in the future to be iteratively think of the best reponse by iterative brainstorming responses and think which is the response is the best. The idea of prompt engineering is to get the model to learn the shortcut (output tokens are not free).

Why don't you tune the `JUDGE_CLASSIFICATION_PROMPT`?
- In this tool we assume that we trust `JUDGE_CLASSIFICATION_PROMPT`. You should tune this elsewhere.

In [1]:
from functools import cache

import anthropic
client = anthropic.Anthropic()

# Inputs for the tool user

In [2]:
import pandas as pd
df = pd.read_csv("source_questions_and_answers.csv")
df.head(2)

Unnamed: 0,source_question_and_answer
0,How would you rate the success of the Olympic ...
1,"Who is more of a pure talent, Lebron or Curry?..."


In [3]:
INPUT_DATA_SAMPLES: list[str] = list(df["source_question_and_answer"])
source_questions_and_answers = INPUT_DATA_SAMPLES

In [4]:
INITIAL_GENERATION_PROMPT = """
You are given the following source question and a source answer.

<source_question_and_answer>
{source_question_and_answer}
</source_question_and_answer>

Write between 3 to 5 followup questions to the answer.

Return each question between <question> and </question>
""".strip()

In [5]:
JUDGE_CLASSIFICATION_PROMPT = """
You are given an answer and two sets of followup questions.

Determine which set of followup questions is better.

The better set of followup questions should have these characteristics
- Each question could be understood without context
- Each question should be written concisely
- Each question should appear between <question> and </question>
- There should be between 3 to 5 questions.
- Each question should be distinct.

This is the source question and source answer
<source_question_and_answer>
{source_question_and_answer}
</source_question_and_answer>

This is the first set of followup questions

{followup_questions_one}

This is the second set of followup questions

{followup_questions_two}

Write some concise reasoning, end your response with one for the follow
- Set <label>one</label> is better.
- Set <label>two</label> is better.
- Both sets <label>tie</label>.
""".strip()

# Judge classification prompt

In [6]:
# We assume that we trust this prompt
judge_classification_prompt = JUDGE_CLASSIFICATION_PROMPT

def judge_classification(
    source_question_and_answer: str,
    followup_questions_one: str,
    followup_questions_two: str,
) -> tuple[str, str]:
    # return either one, two or tie
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": judge_classification_prompt.format(
                    source_question_and_answer=source_question_and_answer,
                    followup_questions_one=followup_questions_one,
                    followup_questions_two=followup_questions_two,
                )
            },
        ]
    )
    message_text = message.content[0].text
    if "<label>one</label>" in message_text:
        return message_text, "one"
    if "<label>one</label>" in message_text:
        return message_text, "two"
    return message_text, "tie"

In [7]:
source_question_and_answer = """
Who is more of a pure talent, Lebron or Curry?

I saw a recent interview with Team USA, where basketball players shared when they first dunked. Some guys dunked at 11 or 12. A few were “late bloomers” who didn’t dunk until 17. Steph didn’t dunk until he was 19 years old.

This kid’s never survived on talent. Never. He’s 6′3″ with shoes, four inches shorter than the NBA average. He’s got a solid vertical (35.5″) but would never have made it to the NBA if not for his shooting ability. His height plus his less “glamorous” college career made him a risky first-round pick. The Warriors grabbed him with the 7th pick. They had no idea

Compare that to LeBron.

He was a 6′7″ 16-year-old and unstoppable in high school. His numbers were unreal—not just the scoring, either, but the rebounding and passing, too. LeBron clocked a vertical north of 40 inches, more than enough to dunk on anybody in the league. The King didn’t need to really workshop his shooting until several years into his career. That’s talent.

I mean no disrespect to either player. Steph’s still very talented and LeBron still put a LOT of work in. But if they both won the genetic lottery, Steph won $10 million and LeBron won $100 million. Guys like Steph happen. They just rarely have the same shooting foundation (thanks to an NBA father) and work ethic to pull off his career. Guys like LeBron are much rarer. But neither could have survived with just talent or skill. That’s no longer possible in the league.
""".strip()

followup_questions_one = """
<question>How do basketball players with high vertical leaps compare to those with less-than-average verticals?</question>

<question>Is work ethic more important than natural talent in basketball?</question>

<question>What are some examples of athletes who overcame physical limitations to achieve success?</question>

<question>How important is it for basketball players to have a solid foundation in shooting?</question>

<question>How has the role of talent and skill evolved in the NBA over time?</question>
"""

followup_questions_two = """
1. <question>How has the balance between natural talent and skill development evolved in the NBA over the past few decades, and are there other examples that illustrate this shift?</question>

2. <question>You mentioned Curry's NBA father providing a shooting foundation - can you elaborate on other successful NBA players who had to overcome physical limitations through specialized skills passed down from family members?</question>

3. <question>In your genetic lottery analogy, where would you place other current NBA superstars on the spectrum between Curry ($10M) and LeBron ($100M), and why?</question>

4. <question>Given the contrast between early and late physical developers in basketball, what other "late bloomers" besides Curry have gone on to achieve NBA success despite not being early athletic standouts?</question>
"""

In [8]:
justification, judgement = judge_classification(
    source_question_and_answer,
    followup_questions_one,
    followup_questions_two,
)

In [9]:
judgement

'one'

In [10]:
print(justification)

Let me analyze both sets:

Set 1:
- Questions are concise and clear
- Each can stand alone without context
- Covers broad but related topics
- Has exactly 5 questions
- Easy to understand and answer

Set 2:
- Questions are lengthy and compound
- Requires context from the original answer
- Some questions are too specific (e.g., the genetic lottery analogy)
- Has 4 questions
- Multiple questions within single questions

While both sets explore interesting angles, Set 1 is superior because:
1. It maintains simplicity while covering important topics
2. Questions are independent and don't require original context
3. Each question focuses on one specific aspect
4. The format is clean and consistent
5. Follows all the given criteria perfectly

Set <label>one</label> is better.


# Generation prompt

In [11]:
initial_generation_prompt = INITIAL_GENERATION_PROMPT

In [12]:
def generation(
    generation_prompt: str,
    source_question_and_answer: str,
) -> str:
    try:
        generation_prompt_with_inputs = generation_prompt.format(
            source_question_and_answer=source_question_and_answer,
        )
    except:
        return "{source_question_and_answer} should appear in the prompt"
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": generation_prompt_with_inputs
            },
        ]
    )
    message_text = message.content[0].text
    return message_text

In [13]:
followup_questions = generation(initial_generation_prompt, source_question_and_answer)

In [14]:
print(followup_questions)

<question>How much did having an NBA player as a father (Dell Curry) influence Steph's development as a shooter compared to players who didn't have that advantage?</question>

<question>If LeBron was clearly the more naturally gifted athlete, what specific skills or aspects of his game did he have to work hardest to develop throughout his career?</question>

<question>How common is it for NBA players to make it to the league primarily based on their shooting ability like Curry, rather than their athletic attributes?</question>

<question>What would you consider the minimum combination of natural talent and developed skills required to succeed in today's NBA compared to previous eras?</question>


# Win rate calculation

In [15]:
import concurrent.futures

def calculate_winrate(
    source_questions_and_answers,
    generation_prompt_one,
    generation_prompt_two
):
    one_win = 0 
    two_win = 0

    one_wins_judgements = []
    one_loses_judgements = []
    two_wins_judgements = []
    two_loses_judgements = []
    one_tie_judgements = []
    two_tie_judgements = []
    
    def calculate_winrate_single(source_question_and_answer):
        followup_questions_one = generation(
            generation_prompt=generation_prompt_one,
            source_question_and_answer=source_question_and_answer,
        )
        followup_questions_two = generation(
            generation_prompt=generation_prompt_two,
            source_question_and_answer=source_question_and_answer,
        )
        justification, judgement = judge_classification(
            source_question_and_answer=source_question_and_answer,
            followup_questions_one=followup_questions_one,
            followup_questions_two=followup_questions_two,
        )
        return justification, judgement
        
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        results = executor.map(calculate_winrate_single, source_questions_and_answers)
        results = list(results)    
    
    for justification, judgement in results:
        if judgement == "one":
            one_win += 1
            one_wins_judgements.append(justification)
            two_loses_judgements.append(justification)
        elif judgement == "two":
            two_win += 1
            two_wins_judgements.append(justification)
            one_loses_judgements.append(justification)
        else:
            one_win += 1/2
            two_win += 1/2
            one_tie_judgements.append(justification)
            two_tie_judgements.append(justification)
    return (
        (
            one_win / (one_win + two_win),
            one_wins_judgements,
            one_tie_judgements,
            one_loses_judgements,
        ),
        (
            two_win / (one_win + two_win),
            two_wins_judgements,
            two_tie_judgements,
            two_loses_judgements,
        ),
    )

In [16]:
evaluation_one, evaluation_two = calculate_winrate(
    source_questions_and_answers = source_questions_and_answers,
    generation_prompt_one = initial_generation_prompt,
    generation_prompt_two = initial_generation_prompt,
)

In [17]:
print(str(evaluation_one)[:500])

(0.65, ['Let me analyze both sets:\n\nSet 1:\n- Questions are more concise\n- Questions can stand alone without context\n- Questions avoid unnecessary words like "you mention" or "that people"\n- Questions maintain professional tone\n- All questions are distinct and relevant\n- Format follows the requested structure\n- Has 4 questions (within 3-5 range)\n\nSet 2:\n- Questions contain extra words that don\'t add value\n- Some questions include contextual phrases like "When you mention"\n- Questio


# Prompt optimization

In [18]:
evaluation_string_template = """
Winrate: {winrate}
Cases where the prompt won: {win_judgements}
Cases where the prompt ties: {tie_judgements}
Cases where the prompt loses: {lose_judgements}
"""

In [19]:
optimization_prompt_template = """
Improve the generation prompt according to the feedback

{generation_prompt}

{evaluation_string}

Return the new prompt between <prompt> and </prompt>
""".strip()

In [20]:
def optimization(evaluation, generation_prompt) -> str:
    winrate, win_judgements, tie_judgements, lose_judgements = evaluation
    evaluation_string = evaluation_string_template.format(
        winrate=winrate,
        win_judgements=win_judgements,
        tie_judgements=tie_judgements,
        lose_judgements=lose_judgements,
    )
    optimization_prompt = optimization_prompt_template.format(
        generation_prompt=generation_prompt,
        evaluation_string=evaluation_string,        
    )
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": optimization_prompt
            },
        ]
    )
    
    optimization_response = message.content[0].text
    optimized_prompt = extract_from_tags(optimization_response, tag_string="prompt")
    
    return optimization_response, optimized_prompt

In [21]:
import re
def extract_from_tags(text, tag_string="prompt"):
    pattern = f'<{tag_string}>(.*?)</{tag_string}>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return ''

In [22]:
optimization_response, optimized_prompt = optimization(evaluation_one, initial_generation_prompt)

In [23]:
print(optimization_response[:300])

Based on the analysis of winning and tie cases, I'll create an improved prompt that emphasizes conciseness, specificity, and standalone clarity:

<prompt>
You are given the following source question and a source answer.

<source_question_and_answer>
{source_question_and_answer}
</source_question_and


In [24]:
print(optimized_prompt)

You are given the following source question and a source answer.

<source_question_and_answer>
{source_question_and_answer}
</source_question_and_answer>

Write between 3 to 5 followup questions to the answer following these criteria:
- Make each question concise and direct
- Ensure each question can stand alone without context
- Focus on specific details rather than broad speculation
- Avoid phrases like "you mention" or "could you elaborate"
- Maintain distinct topics for each question
- Include both theoretical and practical aspects where relevant

Return each question between <question> and </question>


# Iterative optimization

Initial prompt `inital_generation_prompt`

-> 2x generation-one prompts `generation_prompt_one_{1,2}`

-> 4x generation-two prompts `generation_prompt_two_{1,2,3,4}`

We show the win rate table.

In [25]:
optimization_response, generation_prompt_one = optimization(evaluation_one, initial_generation_prompt)

In [26]:
optimization_response, generation_prompt_two = optimization(evaluation_one, initial_generation_prompt)

In [27]:
print(generation_prompt_one)

You are given the following source question and a source answer.

<source_question_and_answer>
{source_question_and_answer}
</source_question_and_answer>

Generate 3 to 5 followup questions to the answer. Each question should:
- Be concise and direct
- Focus on a single aspect or concept
- Stand alone without requiring context
- Avoid phrases like "you mention" or "could you elaborate"
- Build naturally from specific points in the source
- Ask for specific details rather than general opinions

Return each question between <question> and </question>


In [28]:
print(generation_prompt_two)

You are given the following source question and a source answer.

<source_question_and_answer>
{source_question_and_answer}
</source_question_and_answer>

Write between 3 to 5 followup questions to the answer. Follow these guidelines:
1. Write concise questions that avoid unnecessary words
2. Ensure each question can be understood without context
3. Make each question focus on a single, distinct aspect
4. Include both specific details and broader implications
5. Build questions that naturally flow from the source material

Return each question between <question> and </question>
