# Convert your judge into a generation prompt

Given
- a certified judge classifier `JUDGE_CLASSIFICATION_PROMPT: str` and optionally `process_response_for_judgement: func[str] -> str`
- some input string data samples `INPUT_CONTENTS: tuple[str]`
- initial prompt `INITIAL_GENERATION_PROMPT: str`

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Foreword

If you can't verify what is good or what is bad, you can't make good generations.

But if you can verify, you should easily be able to make good generations - and this tool helps you write this prompt.


#### Inspirations

- Cohere prompt tuner https://cohere.com/blog/intro-prompt-tuner
- Anthropic workbench https://console.anthropic.com/workbench/
- Chatbot Arena leaderboard https://lmarena.ai/

## Example use case presented in this notebook

Given a Quora answer, write followup questions to the answer.

Great followup questions should be appealing to respond to and answers should be appealing to read.

The inputs for this use case
- `JUDGE_CLASSIFICATION_PROMPT: str` - prompt that classifies whether one set of output is better than the other
- `process_response_for_judgement: func[str] -> str` - how the output is post-processed for judgement
- `INPUT_CONTENTS: tuple[str]` - Quora answers
- `INITIAL_GENERATION_PROMPT: str` - prompt that generates followup questions

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Frequently asked questions

Why don't you include prompt for `JUDGE_CLASSIFICATION` in the generation prompt?
- You can if you want. But the result still needs to perform better according to the judge.
- I expect models in the future to be brainstorming responses and then think carefully which is the response is the best. However, doing so will incur extra cost and latency, and we might not want this tradeoff.
- The idea of prompt engineering for the generation prompt is to teach the model shortcuts on what good outputs are.

Why don't you tune the `JUDGE_CLASSIFICATION` as well?
- In this tool we assume that we trust `JUDGE_CLASSIFICATION`.
- It is important to get this right. You should tune this, and tune this elsewhere.
- In practice we do want to provide feedback to point out where the judge is obviously wrong. I leave this to the roadmap.

Why is the `JUDGE_CLASSIFICATION` doing comparison of two generation outputs instead of classifying whether one is good?
- For some answers it is easy to generate good followup questions but tricky to generate great followup questions.
- In other answers it may be even difficult to generate followup questions that are not outright bad.
- I want the judge to be useful in both cases.

Why is `judge_classification` is also an input to this tool?
- In my prompt, I ask the output to include a rationale along with the text we show to the user.
- I don't want the rationale to bias the judgement of text we show to the user.
- You might need different post-processing methods for different use-cases.

# TODO

- Print more intermediate results to html
- Improve the optimization prompt
- Make the optimization prompt write all of system prompt, user prompt and assistant prefill

In [1]:
from functools import cache

import anthropic
client = anthropic.Anthropic()

# Inputs for the tool user

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("content.csv")

In [3]:
df.head(2)

Unnamed: 0,content
0,"Why do AI image generators have so much trouble rendering hands?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nThe current AI image generators are not rewarded for drawing hands well.\nThe training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).\nL(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).\nHowever, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.\nIf you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question."
1,"At what point do you consider an LLM to have an ability to plan?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nConsider the following query and response\nUser:\nHow many r are there in strawberry? Check your answer\nAssistant:\nThere are 2 'r' letters in the word ""strawberry"".\nTo check this answer, let's break down the word:\ns-t-r-a-w-b-e-r-r-y\nWe can see that there are indeed two instances of the letter 'r' in ""strawberry"":\n1. The first 'r' comes after 't'\n2. The second 'r' is the second-to-last letter\nThis confirms that the answer of 2 'r' letters is correct.\n\nhttps://poe.com/s/X6Rkh9wDJ50PSc...\nThe LLM has planned to BS their justification when they print the number “2”.\nBad planning is still planning."


In [4]:
INPUT_CONTENTS: tuple[str] = tuple(df["content"])

In [5]:
INITIAL_GENERATION_PROMPT = """
This is a Quora answer. Write 5 followup questions to this answer.

Requirements

- All phrases in the question should not require any context to be understood.
- The questions should not use more words than necessary.
- These questions are questions that the author would answer and people would read.


Reply in this format

<item>
<rationale>
rationale
</rationale>
<draft>
draft question
</draft>
<issues>
Identify all phrases that could not be understood without context
Pay special attention to the following words
- other, besides, similar
</issues>
<draft>
<question>
fixed question
</question>
</item>
...


This is the question and answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

""".strip()

In [6]:
JUDGE_CLASSIFICATION_PROMPT = """
You are given an answer and two sets of followup questions.

Determine which set of followup questions is better.

Analyze the followup questions for issues.

These are the common issues

- The question has phrases that could not be understood without reading the source question and answer
    - Examples of questions with this issue:
        - What functions or teams still regularly use the remaining Castro Street office?
            - Reason: "remaining Castro Street office" does not make sense without context
        - What has been Mike's best performance in an external competitive programming contest?
            - Reason: Too vauge who Mike is
        - How long has your SoundLink Mini II been working properly since applying this fix?
            - Reason: This fix does not make sense without context
- The question is using extra words than necessary
- The question asks for trivial information that is easily found online

Good questions are questions that

- The author is likely to answer the question
- The readers interested in the source question and answer will also be interested in answers to the followup questions


This is the source question and source answer

<source_question_and_answer>
{input_content}
</source_question_and_answer>

This is the first set of followup questions

{response_one}

This is the second set of followup questions

{response_two}

Reply in this format

<Set1>
<item>
<question>question</question>
<issues>issues</issues>
</item>
...
</Set1>

<Set2>
<item>
<question>question</question>
<issues>issues</issues>
</item>
...
</Set2>

Write some reasoning, end your response with one for the following
- Set <label>one</label> is better.
- Set <label>two</label> is better.
- Both sets <label>tie</label>.

Consider a tie if one set of questions is not clearly better than the other.
""".strip()

In [7]:
import re
def process_response_for_judgement(response):
    pattern = r'<question>(.*?)</question>'
    return "\n\n".join(re.findall(pattern, response, re.DOTALL))

# Judge classification

In [8]:
# We assume that we trust this prompt

def judge_classification(
    input_content: str,
    response_one: str,
    response_two: str,
) -> tuple[str, str]:
    # return either one, two or tie
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": JUDGE_CLASSIFICATION_PROMPT.format(
                    input_content=input_content,
                    response_one=process_response_for_judgement(response_one),
                    response_two=process_response_for_judgement(response_two),
                )
            },
        ],
    )
    message_text = message.content[0].text
    if "<label>one</label>" in message_text:
        return message_text, "one"
    if "<label>two</label>" in message_text:
        return message_text, "two"
    return message_text, "tie"

In [9]:
input_content = """
Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.
""".strip()

response_one = """
<question>What are some AI image generation models?</question>

<question>Where can I find the training set for the image models?</question>

<question>What are some ways to improve the quality of AI-generated images?</question>

<question>What is the role of captions in AI image generation?</question>

<question>How is L(caption, image) minimized?</question>
"""

response_two = """
<question>How do we reward AI image generators to draw hands correctly?</question>

<question>Why are hands particularly challenging compared to other anatomical features?</question>

<question>What preprocessing techniques have worked best for curating training datasets that lead to better hand renderings?</question>

<question>What specific loss functions have you found most effective for detecting and penalizing anatomical deformities in AI-generated images?</question>

<question>How do different AI image generation models (like Stable Diffusion, DALL-E, Midjourney) compare in their ability to render hands?</question>
"""

In [10]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_one,
    response_two=response_two,
)

In [11]:
judgement

'two'

In [12]:
print(justification)

<Set1>
<item>
<question>What are some AI image generation models?</question>
<issues>Too trivial - easily found online</issues>
</item>
<item>
<question>Where can I find the training set for the image models?</question>
<issues>Too trivial - easily found online</issues>
</item>
<item>
<question>What are some ways to improve the quality of AI-generated images?</question>
<issues>Too broad and general</issues>
</item>
<item>
<question>What is the role of captions in AI image generation?</question>
<issues>Already explained in the answer</issues>
</item>
<item>
<question>How is L(caption, image) minimized?</question>
<issues>Too technical without proper context</issues>
</item>
</Set1>

<Set2>
<item>
<question>How do we reward AI image generators to draw hands correctly?</question>
<issues>None - follows directly from answer</issues>
</item>
<item>
<question>Why are hands particularly challenging compared to other anatomical features?</question>
<issues>None - relevant extension of origin

In [13]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_two,  # swapped
    response_two=response_one,  # swapped
)

In [14]:
judgement

'one'

In [15]:
print(justification)

<Set1>
<item>
<question>How do we reward AI image generators to draw hands correctly?</question>
<issues>None - directly follows from the answer's discussion of rewards</issues>
</item>
<item>
<question>Why are hands particularly challenging compared to other anatomical features?</question>
<issues>None - natural extension of the original question</issues>
</item>
<item>
<question>What preprocessing techniques have worked best for curating training datasets that lead to better hand renderings?</question>
<issues>None - technical but relevant follow-up</issues>
</item>
<item>
<question>What specific loss functions have you found most effective for detecting and penalizing anatomical deformities in AI-generated images?</question>
<issues>Assumes personal experience which may not exist</issues>
</item>
<item>
<question>How do different AI image generation models compare in their ability to render hands?</question>
<issues>Could be considered trivial information findable online</issues>
</

# Generation prompt

In [16]:
def generation(
    generation_prompt: str,
    input_content: str,
) -> str:
    try:
        generation_prompt_with_inputs = generation_prompt.format(
            input_content=input_content,
        )
    except:
        return "{input_content} should appear in the prompt"

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": generation_prompt_with_inputs
            },
        ],
        stop_sequences=["</questions>"],
    )
    message_text = message.content[0].text
    return message_text

In [17]:
response = generation(INITIAL_GENERATION_PROMPT, input_content)

In [18]:
print(response)

Let me help create 5 follow-up questions:

<item>
<rationale>
The answer mentions training data and captions, so asking about specific datasets would be relevant.
</rationale>
<draft>
What datasets are commonly used to train AI image generators?
</draft>
<issues>
No context issues found
</issues>
<question>
What datasets are commonly used to train AI image generators?
</question>
</item>

<item>
<rationale>
The answer discusses loss functions, so exploring alternative loss functions is relevant.
</rationale>
<draft>
What other loss functions could improve hand generation in AI models?
</draft>
<issues>
"other" requires context of existing loss functions
</issues>
<question>
Which loss functions could improve hand generation in AI models?
</question>
</item>

<item>
<rationale>
The answer mentions deformities but doesn't explain why hands specifically are problematic.
</rationale>
<draft>
Why are hands more difficult to generate than other body parts?
</draft>
<issues>
No context issues

# Win rate calculation

In [19]:
import concurrent.futures

@cache
def calculate_winrate(
    input_contents: tuple[str],
    generation_prompt_one: str,
    generation_prompt_two: str,
    filename_suffix: str = "",
):
    one_win = 0
    two_win = 0

    one_wins_judgements = []
    one_loses_judgements = []
    two_wins_judgements = []
    two_loses_judgements = []
    one_tie_judgements = []
    two_tie_judgements = []
    
    def calculate_winrate_single(input_content, index):
        response_one = generation(
            generation_prompt=generation_prompt_one,
            input_content=input_content,
        )
        response_two = generation(
            generation_prompt=generation_prompt_two,
            input_content=input_content,
        )
        flipped = (index%2 == 1)
        if not flipped:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_one,
                response_two=response_two,
            )
        else:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_two,
                response_two=response_one,
            )
        return response_one, response_two, justification, judgement, flipped
        
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        results = executor.map(calculate_winrate_single, input_contents, list(range(len(input_contents))))
        results = list(results)    
    
    judgement_unflipped = []
    for _, _, justification, judgement, flipped in results:
        if not flipped:
            if judgement == "one":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "two":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
        else:
            if judgement == "two":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "one":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
    
    if filename_suffix:
        df_to_display = pd.DataFrame(
            {
                "input_content": [""] + list(input_contents),
                "response_one": [generation_prompt_one] + [response_one for response_one, _, _, _, _ in results],
                "response_two": [generation_prompt_two] + [response_two for _, response_two, _, _, _ in results],
                "judgement": [""] + judgement_unflipped,
                "justification": [""] + [justification for _, _, justification, _, _ in results],
            }
        )
        display_dataframe(df_to_display, filename_suffix=filename_suffix)
        
    return (
        (
            one_win / (one_win + two_win),
            one_wins_judgements,
            one_tie_judgements,
            one_loses_judgements,
        ),
        (
            two_win / (one_win + two_win),
            two_wins_judgements,
            two_tie_judgements,
            two_loses_judgements,
        ),
    )

In [20]:
import os
import html
from IPython.display import display, HTML

def display_dataframe(df: pd.DataFrame, filename_suffix=""):
    html_prefix = '''
    <meta charset="UTF-8">
    <style>
    table {
        border-collapse: collapse;
    }
    td, th {
        border: 1px solid black;
        padding: 5px;
        vertical-align: top;
    }
    td {
        white-space: pre-wrap;
        font-family: monospace;
    }
    </style>
    '''
    
    # Define a single style function that highlights response_one or response_two based on judgement
    def highlight_responses(row):
        styles = ['' for _ in row]
        if row['judgement'] == 'one':
            styles[row.index.get_loc('response_one')] = 'background-color: #90EE90'
        elif row['judgement'] == 'two':
            styles[row.index.get_loc('response_two')] = 'background-color: #90EE90'
        return styles

    os.makedirs("html_output", exist_ok=True)
    output_table_file_name = f"html_output/winrate_calculation{filename_suffix}.html"
    
    # Replace newline characters and escape HTML
    styled_df = df.replace({r'\n': '__NEWLINE__'}, regex=True).applymap(str).applymap(html.escape).replace({'__NEWLINE__': '<br>'}, regex=True)
    
    # Apply the style function
    styled_df = styled_df.style.apply(highlight_responses, axis=1)
    
    # Write the styled DataFrame to an HTML file
    with open(output_table_file_name, 'w') as f:
        f.write(html_prefix + styled_df.render(index=False, escape=False))
    
    # Create a clickable link to the HTML file
    link = f'<a href="{output_table_file_name}" target="_blank">{output_table_file_name}</a>'
    display(HTML(link))

In [21]:
evaluation_one, evaluation_two = calculate_winrate(
    input_contents = INPUT_CONTENTS[:20],
    generation_prompt_one = INITIAL_GENERATION_PROMPT,
    generation_prompt_two = "Generate a question and return in <question> and </question>",
    filename_suffix = "-demo",
)

In [22]:
print(str(evaluation_one)[:500])

(1.0, ["<Set1>\n<item>\n<question>How do loss functions determine image quality in AI generators?</question>\n<issues>None - relevant technical question that builds on the answer</issues>\n</item>\n<item>\n<question>What metrics do AI models use to match images with captions?</question>\n<issues>Somewhat redundant with first question, but still relevant</issues>\n</item>\n<item>\n<question>How can AI models be trained to avoid anatomical deformities?</question>\n<issues>None - directly addresses


# Prompt optimization

In [23]:
evaluation_string_template = """
Winrate: {winrate}
Cases where the prompt won: {win_judgements}
Cases where the prompt ties: {tie_judgements}
Cases where the prompt loses: {lose_judgements}
"""

In [24]:
optimization_prompt_template = """
Improve the generation prompt according to the feedback

<current_generation_prompt>
{generation_prompt}
</current_generation_prompt>

<feedback>
{evaluation_string}
</feedback>

This is the judging criteria
<judging_criteria>
{judging_criteria}
</judging_criteria>

Summarize the changes that you intend to make, and return the new prompt between <prompt> and </prompt>.
""".strip()

In [25]:
def optimization(
    generation_prompt: str,
    evaluations: list, 
) -> str:
    evaluation_string = ""
    for evaluation in evaluations:
        winrate, win_judgements, tie_judgements, lose_judgements = evaluation
        evaluation_string_single = evaluation_string_template.format(
            winrate=winrate,
            win_judgements=win_judgements,
            tie_judgements=tie_judgements,
            lose_judgements=lose_judgements,
        )
        evaluation_string += evaluation_string_single

    optimization_prompt = optimization_prompt_template.format(
        generation_prompt=generation_prompt,
        evaluation_string=evaluation_string,
        judging_criteria=JUDGE_CLASSIFICATION_PROMPT,
    )

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": optimization_prompt
            },
        ]
    )
    
    optimization_response = message.content[0].text
    optimized_prompt = extract_from_tags(optimization_response, tag_string="prompt")
    
    return optimization_response, optimized_prompt

In [26]:
import re
def extract_from_tags(text, tag_string="prompt"):
    pattern = f'<{tag_string}>(.*?)</{tag_string}>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return ''

In [27]:
optimization_response, optimized_prompt = optimization(INITIAL_GENERATION_PROMPT, [evaluation_one])

In [28]:
print(optimization_response[:1000])

Based on the feedback, the prompt consistently wins when it produces questions that:
1. Build directly on concepts mentioned in the answer
2. Are relevant to the original topic
3. Would be useful to readers interested in the original question
4. Avoid using contextual phrases
5. Don't ask for easily findable information
6. Are concise

The main changes I'll make:
1. Emphasize that questions should build on specific concepts/points mentioned in the answer
2. Add explicit requirements about avoiding easily findable information
3. Stress the importance of using standalone phrases that don't require context
4. Add examples of good and bad questions
5. Clarify that questions should interest both the original author and readers

Here's the improved prompt:

<prompt>
This is a Quora answer. Write 5 followup questions to this answer.

Requirements:

1. Questions must build directly on specific concepts mentioned in the answer
2. All phrases must be understandable without reading the source que

In [29]:
print(optimized_prompt)

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:

1. Questions must build directly on specific concepts mentioned in the answer
2. All phrases must be understandable without reading the source question/answer
3. Questions should not use more words than necessary
4. Questions should not ask for information that is easily found online
5. Questions should interest both the author and readers of the original post

Examples of bad questions:
- "What does this solution achieve?" (requires context)
- "What are the system requirements?" (easily found online)
- "Besides these approaches, what other methods exist?" (vague "these")

Examples of good questions:
- "How do loss functions determine image quality in AI generators?"
- "What behaviors demonstrate an LLM's ability to plan?"
- "What solutions are being developed to improve AI hand rendering?"

Reply in this format:

<item>
<rationale>
Explain how this builds on a specific point from the answer
</rationale>

# Iterative optimization

In [30]:
generation_prompts = [INITIAL_GENERATION_PROMPT, INITIAL_GENERATION_PROMPT]

In [31]:
import random
for latest_index in range(1, 3):
    generation_prompt_latest = generation_prompts[-1]
    evaluations = []
    for previous_index, generation_prompt_old in enumerate(generation_prompts[:-1]):
        random.seed(f"{previous_index} {latest_index}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        _, evaluation = calculate_winrate(
            input_contents,
            generation_prompt_old,
            generation_prompt_latest,
            filename_suffix = f"-{previous_index}-{latest_index}",
        )
        evaluations.append(evaluation)

    optimization_response, generation_prompt_new = optimization(generation_prompt_latest, evaluations)
    generation_prompts.append(generation_prompt_new)
    print(generation_prompt_new)
    print("---")

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:

- All phrases in the question should be understandable without any additional context
- Questions should use minimal necessary words
- Questions should build naturally from concepts mentioned in the answer
- Questions should focus on specific aspects rather than broad topics
- Questions should not make assumptions beyond what's stated in the answer
- Questions should emphasize practical implementation and real experiences
- Questions should balance technical details with user interests
- Questions should maintain topic consistency with the original answer
- Questions should seek unique insights rather than easily searchable information
- Questions should be ones that both the author would likely answer and readers would want to read

Reply in this format:

<item>
<rationale>
Why this question follows from the answer and would interest readers
</rationale>
<draft>
draft question
</draft>
<issues>
Identify

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:

- Questions should focus on personal experience and unique insights rather than easily searchable information
- Use minimal necessary words while maintaining clarity
- Build naturally from specific details and unique aspects mentioned in the answer
- Focus on practical implementation details and real experiences
- Questions should interest readers who found the original answer valuable
- Ask about specific aspects rather than broad topics
- Don't make assumptions beyond what's stated in the answer
- Focus on actionable details over general concepts
- Target insights that aren't easily found through online searches
- Questions should naturally extend from the unique perspective or experience shared in the answer

Reply in this format:

<item>
<rationale>
Why this question:
- Builds on unique aspects of the answer
- Would interest original readers
- Asks for insights not easily found elsewhere
</rationale>

# Display winrate table

In [32]:
import numpy as np
win_rate_matrix = [[np.nan for _ in generation_prompts] for _ in generation_prompts]

for one_idx, prompt_one in enumerate(generation_prompts):
    for two_idx, prompt_two in enumerate(generation_prompts[one_idx+1:], start=one_idx+1):
        random.seed(f"{one_idx} {two_idx}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        evaluation_one, evaluation_two = calculate_winrate(
            input_contents,
            prompt_one,
            prompt_two,
            filename_suffix = f"-{one_idx}-{two_idx}",
        )
        win_rate_one, _, _, _ = evaluation_one
        win_rate_two, _, _, _ = evaluation_two
        win_rate_matrix[two_idx][one_idx] = win_rate_one
        win_rate_matrix[one_idx][two_idx] = win_rate_two

In [33]:
win_rate_matrix

[[nan, 0.675, 0.8, 0.575],
 [0.325, nan, 0.85, 0.6],
 [0.2, 0.15, nan, 0.5],
 [0.425, 0.4, 0.5, nan]]

In [34]:
import numpy as np
from IPython.display import HTML

num_prompts = len(generation_prompts)

# Start building the HTML table
html_table = "<table border='1' style='border-collapse: collapse;'>"

# Create the header row
html_table += "<tr><th></th>"
for j in range(num_prompts):
    html_table += f"<th>Prompt {j}</th>"
html_table += "</tr>"

for i in range(num_prompts):
    html_table += f"<tr><th>Prompt {i}</th>"
    for j in range(num_prompts):
        if i == j or np.isnan(win_rate_matrix[i][j]):
            html_table += "<td></td>"  # Empty cell for diagonal or undefined win rates
        else:
            win_rate = win_rate_matrix[i][j]
            cell_html = f'<a href="html_output/winrate_calculation-{min(i,j)}-{max(i,j)}.html" target="_blank" style="text-decoration: none;">{win_rate:.2f}</a>'
            html_table += f"<td>{cell_html}</td>"
    html_table += "</tr>"
html_table += "</table>"

display(HTML(html_table))

Unnamed: 0,Prompt 0,Prompt 1,Prompt 2,Prompt 3
Prompt 0,,0.68,0.8,0.57
Prompt 1,0.33,,0.85,0.6
Prompt 2,0.2,0.15,,0.5
Prompt 3,0.42,0.4,0.5,


# Conclusion
You notice that the prompts keeps getting better, the each prompt is better than the previous prompt, according the the judge we assume we trust.

In [35]:
print(input_content)

Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.


In [36]:
response = generation(generation_prompts[0], input_content)

In [37]:
print(response)

I'll create 5 follow-up questions based on this answer.

<item>
<rationale>
This explores the technical aspect of how loss functions work in AI image generation.
</rationale>
<draft>
How does the loss function determine if an image matches its caption?
</draft>
<issues>
No context issues found
</issues>
<question>
How does the loss function determine if an image matches its caption?
</question>
</item>

<item>
<rationale>
Addresses the fundamental problem mentioned about deformity detection.
</rationale>
<draft>
How can AI detect deformities in generated images?
</draft>
<issues>
No context issues found
</issues>
<question>
How can AI detect deformities in generated images?
</question>
</item>

<item>
<rationale>
Explores potential solutions to the hand-rendering problem.
</rationale>
<draft>
What training methods could improve AI hand rendering?
</draft>
<issues>
No context issues found
</issues>
<question>
What training methods could improve AI hand rendering?
</question>
</item>

<i

In [38]:
response = generation(generation_prompts[-1], input_content)

In [39]:
print(response)

Here are 5 follow-up questions based on that answer:

<item>
<rationale>
- Explores specific technical implementation details
- Builds on the loss function concept mentioned
- Seeks practical insight from someone with direct experience
</rationale>
<draft>
What specific loss function modifications have you tried or seen that improve hand rendering quality?
</draft>
<issues>
- Assumes reader has tried modifications
- Could be more focused on observed results
</issues>
<question>
Which loss function adjustments have proven most effective for improving hand rendering in your experience?
</question>
</item>

<item>
<rationale>
- Addresses core problem of deformity detection
- Builds on technical explanation
- Seeks specific implementation details
</rationale>
<draft>
How could an AI system be trained to specifically recognize and penalize hand deformities?
</draft>
<issues>
- Too theoretical
- Needs more focus on practical experience
</issues>
<question>
What methods have you seen used to 