# Convert your judge into a generation prompt

Given
- a certified judge classifier `JUDGE_CLASSIFICATION_PROMPT: str`
- some input string data samples `INPUT_CONTENTS: tuple[str]`
- initial prompt `INITIAL_GENERATION_PROMPT: str`

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Foreword

If you can't verify what is good or what is bad, you can't make good generations.

But if you can verify, you should easily be able to make good generations - and this tool helps you write this prompt.


#### Inspirations

- Cohere prompt tuner https://cohere.com/blog/intro-prompt-tuner
- Anthropic workbench https://console.anthropic.com/workbench/
- Chatbot Arena leaderboard https://lmarena.ai/

## Example use case presented in this notebook

Given a Quora answer, write followup questions to the answer.

Great followup questions should be appealing to respond to and answers should be appealing to read.

The inputs for this use case
- `JUDGE_CLASSIFICATION_PROMPT: str` - prompt that classifies whether one set of output is better than the other
- `process_response_for_judgement: func[str] -> str` - how the output is post-processed for judgement
- `INPUT_CONTENTS: tuple[str]` - Quora answers
- `INITIAL_GENERATION_PROMPT: str` - prompt that generates followup questions

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Frequently asked questions

Why don't you include prompt for `JUDGE_CLASSIFICATION` in the generation prompt?
- You can if you want. But the result still needs to perform better according to the judge.
- I expect models in the future to be brainstorming responses and then think carefully which is the response is the best. However, doing so will incur extra cost and latency, and we might not want this tradeoff.
- The idea of prompt engineering for the generation prompt is to teach the model shortcuts on what good outputs are.

Why don't you tune the `JUDGE_CLASSIFICATION` as well?
- In this tool we assume that we trust `JUDGE_CLASSIFICATION`.
- It is important to get this right. You should tune this, and tune this elsewhere.
- In practice we do want to provide feedback to point out where the judge is obviously wrong. I leave this to the roadmap.

Why is the `JUDGE_CLASSIFICATION` doing comparison of two generation outputs instead of classifying whether one is good?
- For some answers it is easy to generate good followup questions but tricky to generate great followup questions.
- In other answers it may be even difficult to generate followup questions that are not outright bad.
- I want the judge to be useful in both cases.

Why is `judge_classification` is also an input to this tool?
- In my prompt, I ask the output to include a rationale along with the text we show to the user.
- I don't want the rationale to bias the judgement of text we show to the user.
- You might need different post-processing methods for different use-cases.

# TODO

- Print more intermediate results to html
- Improve the optimization prompt
- Make the optimization prompt write all of system prompt, user prompt and assistant prefill

In [1]:
from functools import cache

import anthropic
client = anthropic.Anthropic()

# Inputs for the tool user

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("content.csv")

In [3]:
df.head(2)

Unnamed: 0,content
0,"Why do AI image generators have so much trouble rendering hands?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nThe current AI image generators are not rewarded for drawing hands well.\nThe training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).\nL(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).\nHowever, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.\nIf you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question."
1,"At what point do you consider an LLM to have an ability to plan?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nConsider the following query and response\nUser:\nHow many r are there in strawberry? Check your answer\nAssistant:\nThere are 2 'r' letters in the word ""strawberry"".\nTo check this answer, let's break down the word:\ns-t-r-a-w-b-e-r-r-y\nWe can see that there are indeed two instances of the letter 'r' in ""strawberry"":\n1. The first 'r' comes after 't'\n2. The second 'r' is the second-to-last letter\nThis confirms that the answer of 2 'r' letters is correct.\n\nhttps://poe.com/s/X6Rkh9wDJ50PSc...\nThe LLM has planned to BS their justification when they print the number “2”.\nBad planning is still planning."


In [4]:
INPUT_CONTENTS: tuple[str] = tuple(df["content"])

In [5]:
INITIAL_GENERATION_PROMPT = """
This is a Quora answer. Write 5 followup questions to this answer.

Requirements

- Questions asked should not require context to be understood.
- The questions should not use more words than necessary.
- These questions are questions that the author would answer and people would read.


Reply in this format

<item>
<rationale>
rationale
</rationale>
<question>
question
</question>
</item>
...


This is the question and answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

""".strip()

In [6]:
JUDGE_CLASSIFICATION_PROMPT = """
You are given an answer and two sets of followup questions.

Determine which set of followup questions is better.

These are required
- Each question should appear between <question> and </question>
- There should be 5 questions

The followup questions are considered bad if they have any of these characteristics
- There ambiguous references in the question that cannot be understood without context
- The question is using extra words than necessary

If both set of questions have neither of the bad characteristics, the better set of followup questions should have these characteristics
- Each question should be distinct
- The questions ask for knowledge that is not easily found online
- The author is likely to answer the questions
- People interested in the source question and answer will also be interested in answers to the followup questions

This is the source question and source answer
<source_question_and_answer>
{input_content}
</source_question_and_answer>

This is the first set of followup questions

{response_one}

This is the second set of followup questions

{response_two}

Write some reasoning, end your response with one for the following
- Set <label>one</label> is better.
- Set <label>two</label> is better.
- Both sets <label>tie</label>.
""".strip()

In [7]:
import re
def process_response_for_judgement(response):
    pattern = r'<question>(.*?)</question>'
    return "\n\n".join(re.findall(pattern, response, re.DOTALL))

# Judge classification

In [8]:
# We assume that we trust this prompt

def judge_classification(
    input_content: str,
    response_one: str,
    response_two: str,
) -> tuple[str, str]:
    # return either one, two or tie
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": JUDGE_CLASSIFICATION_PROMPT.format(
                    input_content=input_content,
                    response_one=process_response_for_judgement(response_one),
                    response_two=process_response_for_judgement(response_two),
                )
            },
        ],
    )
    message_text = message.content[0].text
    if "<label>one</label>" in message_text:
        return message_text, "one"
    if "<label>two</label>" in message_text:
        return message_text, "two"
    return message_text, "tie"

In [9]:
input_content = """
Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.
""".strip()

response_one = """
<question>What are some AI image generation models?</question>

<question>Where can I find the training set for the image models?</question>

<question>What are some ways to improve the quality of AI-generated images?</question>

<question>What is the role of captions in AI image generation?</question>

<question>What techniques have researchers tried so far to address the hand rendering problem?</question>
"""

response_two = """
<question>How do we reward AI image generators to draw hands correction?</question>

<question>Why are hands particularly challenging compared to other anatomical features?</question>

<question>What preprocessing techniques have worked best for curating training datasets that lead to better hand renderings?</question>

<question>What specific loss functions have you found most effective for detecting and penalizing anatomical deformities in AI-generated images?</question>

<question>How do different AI image generation models (like Stable Diffusion, DALL-E, Midjourney) compare in their ability to render hands?</question>
"""

In [10]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_one,
    response_two=response_two,
)

In [11]:
judgement

'two'

In [12]:
print(justification)

Let me analyze both sets:

Set 1:
- Questions are clear and simple
- However, they are quite general and basic
- Most answers can be easily found online
- Some questions aren't specifically related to the hand-rendering problem
- Questions don't build much on the technical aspects mentioned in the answer

Set 2:
- Questions are more technically focused
- Directly relates to the problem of hand rendering
- Builds upon concepts mentioned in the answer (loss functions, rewards)
- Asks for specific experiences and comparisons
- Questions would be interesting to people who care about the technical aspects of AI image generation
- Questions are challenging but still answerable by someone knowledgeable in the field

The second set is better because:
1. Questions are more specific and targeted
2. They build upon the technical concepts mentioned in the original answer
3. They ask for knowledge that would be valuable but isn't easily found through a simple web search
4. The questions maintain fo

In [13]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_two,  # swapped
    response_two=response_one,  # swapped
)

In [14]:
judgement

'one'

In [15]:
print(justification)

Let me analyze both sets:

Set One:
- Questions are specific and focused on the hand rendering problem
- Questions build upon the original answer's discussion of rewards and loss functions
- Questions show technical depth while remaining answerable
- Questions are distinct and progressive
- Questions target knowledge that would require expert experience, not just Googling
- All questions are relevant to people interested in AI image generation and the hand problem

Set Two:
- Questions are more general and broad about AI image generation
- Only the last question specifically relates to the hand rendering problem
- Questions could largely be answered through basic online research
- Questions don't deeply engage with the technical concepts introduced in the answer
- Questions are less focused on the specific problem discussed in the original answer
- Some questions (like "What are some AI image generation models?") are very basic

The first set is superior because:
1. It maintains focus 

# Generation prompt

In [16]:
def generation(
    generation_prompt: str,
    input_content: str,
) -> str:
    try:
        generation_prompt_with_inputs = generation_prompt.format(
            input_content=input_content,
        )
    except:
        return "{input_content} should appear in the prompt"

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": generation_prompt_with_inputs
            },
        ],
        stop_sequences=["</questions>"],
    )
    message_text = message.content[0].text
    return message_text

In [17]:
response = generation(INITIAL_GENERATION_PROMPT, input_content)

In [18]:
print(response)

<item>
<rationale>
Since the answer discusses training data and loss functions, readers might want to know about specific improvements in this area.
</rationale>
<question>
How could AI training data be modified to better generate hands?
</question>
</item>

<item>
<rationale>
The answer mentions deformities as a general problem, so readers would be interested in comparing hands to other challenging features.
</rationale>
<question>
What other body parts do AI image generators struggle with?
</question>
</item>

<item>
<rationale>
The technical explanation invites a question about practical solutions being developed.
</rationale>
<question>
Which AI models currently generate the most accurate hands?
</question>
</item>

<item>
<rationale>
The answer discusses loss functions, leading to curiosity about alternative approaches.
</rationale>
<question>
What alternative loss functions could improve hand generation?
</question>
</item>

<item>
<rationale>
Since the answer mentions this is an

# Win rate calculation

In [19]:
import concurrent.futures

@cache
def calculate_winrate(
    input_contents: tuple[str],
    generation_prompt_one: str,
    generation_prompt_two: str,
    filename_suffix: str = "",
):
    one_win = 0
    two_win = 0

    one_wins_judgements = []
    one_loses_judgements = []
    two_wins_judgements = []
    two_loses_judgements = []
    one_tie_judgements = []
    two_tie_judgements = []
    
    def calculate_winrate_single(input_content, index):
        response_one = generation(
            generation_prompt=generation_prompt_one,
            input_content=input_content,
        )
        response_two = generation(
            generation_prompt=generation_prompt_two,
            input_content=input_content,
        )
        flipped = (index%2 == 1)
        if not flipped:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_one,
                response_two=response_two,
            )
        else:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_two,
                response_two=response_one,
            )
        return response_one, response_two, justification, judgement, flipped
        
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        results = executor.map(calculate_winrate_single, input_contents, list(range(len(input_contents))))
        results = list(results)    
    
    judgement_unflipped = []
    for _, _, justification, judgement, flipped in results:
        if not flipped:
            if judgement == "one":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "two":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
        else:
            if judgement == "two":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "one":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
    
    if filename_suffix:
        df_to_display = pd.DataFrame(
            {
                "input_content": [""] + list(input_contents),
                "response_one": [generation_prompt_one] + [response_one for response_one, _, _, _, _ in results],
                "response_two": [generation_prompt_two] + [response_two for _, response_two, _, _, _ in results],
                "judgement": [""] + judgement_unflipped,
                "justification": [""] + [justification for _, _, justification, _, _ in results],
            }
        )
        display_dataframe(df_to_display, filename_suffix=filename_suffix)
        
    return (
        (
            one_win / (one_win + two_win),
            one_wins_judgements,
            one_tie_judgements,
            one_loses_judgements,
        ),
        (
            two_win / (one_win + two_win),
            two_wins_judgements,
            two_tie_judgements,
            two_loses_judgements,
        ),
    )

In [20]:
import os
import html
from IPython.display import display, HTML

def display_dataframe(df: pd.DataFrame, filename_suffix=""):
    html_prefix = '''
    <meta charset="UTF-8">
    <style>
    table {
        border-collapse: collapse;
    }
    td, th {
        border: 1px solid black;
        padding: 5px;
        vertical-align: top;
    }
    td {
        white-space: pre-wrap;
        font-family: monospace;
    }
    </style>
    '''
    
    # Define a single style function that highlights response_one or response_two based on judgement
    def highlight_responses(row):
        styles = ['' for _ in row]
        if row['judgement'] == 'one':
            styles[row.index.get_loc('response_one')] = 'background-color: #90EE90'
        elif row['judgement'] == 'two':
            styles[row.index.get_loc('response_two')] = 'background-color: #90EE90'
        return styles

    os.makedirs("html_output", exist_ok=True)
    output_table_file_name = f"html_output/winrate_calculation{filename_suffix}.html"
    
    # Replace newline characters and escape HTML
    styled_df = df.replace({r'\n': '__NEWLINE__'}, regex=True).applymap(str).applymap(html.escape).replace({'__NEWLINE__': '<br>'}, regex=True)
    
    # Apply the style function
    styled_df = styled_df.style.apply(highlight_responses, axis=1)
    
    # Write the styled DataFrame to an HTML file
    with open(output_table_file_name, 'w') as f:
        f.write(html_prefix + styled_df.render(index=False, escape=False))
    
    # Create a clickable link to the HTML file
    link = f'<a href="{output_table_file_name}" target="_blank">{output_table_file_name}</a>'
    display(HTML(link))

In [21]:
evaluation_one, evaluation_two = calculate_winrate(
    input_contents = INPUT_CONTENTS[:20],
    generation_prompt_one = INITIAL_GENERATION_PROMPT,
    generation_prompt_two = "Generate a question and return in <question> and </question>",
    filename_suffix = "-demo",
)

In [22]:
print(str(evaluation_one)[:500])

(0.9943820224719101, ["Let me analyze both sets:\n\nSet 1:\n- All questions are directly related to the source topic\n- Questions are clear and concise\n- No ambiguous references\n- Questions are distinct from each other\n- Questions address technical aspects that would interest someone reading about AI image generation\n- The questions logically follow from the explanation about loss functions and training\n- The author, being a Machine Learning Engineer, would likely be qualified to answer the


# Prompt optimization

In [23]:
evaluation_string_template = """
Winrate: {winrate}
Cases where the prompt won: {win_judgements}
Cases where the prompt ties: {tie_judgements}
Cases where the prompt loses: {lose_judgements}
"""

In [24]:
optimization_prompt_template = """
Improve the generation prompt according to the feedback

<current_generation_prompt>
{generation_prompt}
</current_generation_prompt>

<feedback>
{evaluation_string}
</feedback>

This is the judging criteria
<judging_criteria>
{judging_criteria}
</judging_criteria>

Summarize the changes that you intend to make, and return the new prompt between <prompt> and </prompt>.
""".strip()

In [25]:
def optimization(
    generation_prompt: str,
    evaluations: list, 
) -> str:
    evaluation_string = ""
    for evaluation in evaluations:
        winrate, win_judgements, tie_judgements, lose_judgements = evaluation
        evaluation_string_single = evaluation_string_template.format(
            winrate=winrate,
            win_judgements=win_judgements,
            tie_judgements=tie_judgements,
            lose_judgements=lose_judgements,
        )
        evaluation_string += evaluation_string_single

    optimization_prompt = optimization_prompt_template.format(
        generation_prompt=generation_prompt,
        evaluation_string=evaluation_string,
        judging_criteria=JUDGE_CLASSIFICATION_PROMPT,
    )

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": optimization_prompt
            },
        ]
    )
    
    optimization_response = message.content[0].text
    optimized_prompt = extract_from_tags(optimization_response, tag_string="prompt")
    
    return optimization_response, optimized_prompt

In [26]:
import re
def extract_from_tags(text, tag_string="prompt"):
    pattern = f'<{tag_string}>(.*?)</{tag_string}>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return ''

In [27]:
optimization_response, optimized_prompt = optimization(INITIAL_GENERATION_PROMPT, [evaluation_one])

In [28]:
print(optimization_response[:1000])

Based on the feedback, I notice these key patterns for successful prompts:

1. The winning sets consistently:
- Maintained topic relevance
- Had exactly 5 questions
- Asked distinct questions
- Used concise language without ambiguity
- Built naturally from the source content
- Asked for knowledge requiring expertise/experience
- Would interest the original audience
- Could be reasonably answered by the author

2. The losing sets typically:
- Were completely off-topic
- Had only 1 question instead of 5
- Asked about random topics unrelated to the source
- Could be easily googled

Changes I'll make to improve the prompt:

1. Add explicit requirements for topical relevance
2. Emphasize that questions should build upon the source content
3. Add clarity about avoiding easily googleable questions
4. Stress the importance of asking questions the author could reasonably answer
5. Maintain the XML structure for consistency
6. Keep the focus on practical, experience-based knowledge

Here's the i

In [29]:
print(optimized_prompt)

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:
- Questions must be directly related to the source content
- Questions should build naturally from information provided in the answer
- Questions should ask for knowledge that requires expertise or personal experience (not easily googleable)
- Questions should be ones that the author would be qualified to answer based on their demonstrated knowledge
- Questions should interest people who read the original answer
- Questions must be clear without requiring additional context
- Questions should use minimal necessary words
- Each question must be distinct from the others

Reply in this format:

<item>
<rationale>
Why this question follows from the source and what makes it valuable
</rationale>
<question>
The actual question
</question>
</item>

[Repeat for all 5 questions]

This is the question and answer:

<source_question_and_answer>
{input_content}
</source_question_and_answer>


# Iterative optimization

In [30]:
generation_prompts = [INITIAL_GENERATION_PROMPT, INITIAL_GENERATION_PROMPT]

In [31]:
import random
for latest_index in range(1, 4):
    generation_prompt_latest = generation_prompts[-1]
    evaluations = []
    for previous_index, generation_prompt_old in enumerate(generation_prompts[:-1]):
        random.seed(f"{previous_index} {latest_index}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        _, evaluation = calculate_winrate(
            input_contents,
            generation_prompt_old,
            generation_prompt_latest,
            filename_suffix = f"-{previous_index}-{latest_index}",
        )
        evaluations.append(evaluation)

    optimization_response, generation_prompt_new = optimization(generation_prompt_latest, evaluations)
    generation_prompts.append(generation_prompt_new)
    print(generation_prompt_new)
    print("---")

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:
- Questions must be specific and focused on unique aspects mentioned in the answer
- Questions should ask for personal experiences and insights not easily found online
- Questions should follow a logical progression and build upon each other
- Questions should focus on practical implications where appropriate
- Questions must be completely understandable without additional context
- Questions should not use more words than necessary
- Questions must be ones the author would be able to answer based on their demonstrated knowledge
- Questions must maintain thematic consistency with the original answer
- Each question should be distinct from the others

Reply in this format:

<item>
<rationale>
rationale
</rationale>
<question>
question
</question>
</item>
...

This is the question and answer:

<source_question_and_answer>
{input_content}
</source_question_and_answer>
---


This is a Quora answer. Write 5 followup questions to this answer.

Requirements:
- Questions must be specific and focused on unique aspects mentioned in the answer
- Questions should ask for insights and experiences that aren't easily found online
- Questions should follow a logical progression and build upon each other
- Questions should balance specific details with broader implications where appropriate
- Questions must be completely understandable without additional context
- Questions should be appropriately detailed while avoiding unnecessary words
- Questions must not make assumptions about the author's role or involvement
- Questions must align with the demonstrated knowledge in the answer
- Questions must maintain thematic consistency with the original answer
- Each question should be distinct from the others
- Questions should maintain an appropriate scope - neither too narrow nor too broad

Reply in this format:

<item>
<rationale>
rationale
</rationale>
<question>
question

This is a Quora answer. Write 5 followup questions to this answer.

Requirements:
- Questions must be maximally concise while maintaining clarity
- Questions must directly reference specific points or examples from the answer
- Questions should ask for personal experiences and practical insights
- Questions must be answerable based on the expertise demonstrated in the answer
- Questions should maintain a natural, conversational tone
- Questions must be completely understandable without additional context
- Questions must seek information that isn't easily found through online searches
- Questions should build naturally from the themes in the original answer
- Each question must be distinct and focused on a single aspect
- Questions should avoid theoretical or academic abstractions in favor of practical applications

Reply in this format:

<item>
<rationale>
rationale
</rationale>
<question>
question
</question>
</item>
...

This is the question and answer:

<source_question_and_answer>

# Display winrate table

In [32]:
import numpy as np
win_rate_matrix = [[np.nan for _ in generation_prompts] for _ in generation_prompts]

for one_idx, prompt_one in enumerate(generation_prompts):
    for two_idx, prompt_two in enumerate(generation_prompts[one_idx+1:], start=one_idx+1):
        random.seed(f"{one_idx} {two_idx}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        evaluation_one, evaluation_two = calculate_winrate(
            input_contents,
            prompt_one,
            prompt_two,
            filename_suffix = f"-{one_idx}-{two_idx}",
        )
        win_rate_one, _, _, _ = evaluation_one
        win_rate_two, _, _, _ = evaluation_two
        win_rate_matrix[two_idx][one_idx] = win_rate_one
        win_rate_matrix[one_idx][two_idx] = win_rate_two

In [33]:
win_rate_matrix

[[nan, 0.4, 0.75, 0.6, 0.775],
 [0.6, nan, 0.7, 0.6, 0.65],
 [0.25, 0.3, nan, 0.1, 0.6],
 [0.4, 0.4, 0.9, nan, 0.65],
 [0.225, 0.35, 0.4, 0.35, nan]]

In [34]:
import numpy as np
from IPython.display import HTML

num_prompts = len(generation_prompts)

# Start building the HTML table
html_table = "<table border='1' style='border-collapse: collapse;'>"

# Create the header row
html_table += "<tr><th></th>"
for j in range(num_prompts):
    html_table += f"<th>Prompt {j}</th>"
html_table += "</tr>"

for i in range(num_prompts):
    html_table += f"<tr><th>Prompt {i}</th>"
    for j in range(num_prompts):
        if i == j or np.isnan(win_rate_matrix[i][j]):
            html_table += "<td></td>"  # Empty cell for diagonal or undefined win rates
        else:
            win_rate = win_rate_matrix[i][j]
            cell_html = f'<a href="html_output/winrate_calculation-{min(i,j)}-{max(i,j)}.html" target="_blank" style="text-decoration: none;">{win_rate:.2f}</a>'
            html_table += f"<td>{cell_html}</td>"
    html_table += "</tr>"
html_table += "</table>"

display(HTML(html_table))

Unnamed: 0,Prompt 0,Prompt 1,Prompt 2,Prompt 3,Prompt 4
Prompt 0,,0.4,0.75,0.6,0.78
Prompt 1,0.6,,0.7,0.6,0.65
Prompt 2,0.25,0.3,,0.1,0.6
Prompt 3,0.4,0.4,0.9,,0.65
Prompt 4,0.23,0.35,0.4,0.35,


# Conclusion
You notice that the prompts keeps getting better, the each prompt is better than the previous prompt, according the the judge we assume we trust.

In [35]:
print(input_content)

Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.


In [36]:
response = generation(generation_prompts[0], input_content)

In [37]:
print(response)

<item>
<rationale>
Asking about the technical details of the loss function mentioned would help readers understand how AI image quality is measured.
</rationale>
<question>
How exactly do loss functions measure image-caption relationships?
</question>
</item>

<item>
<rationale>
Since the answer mentions that deformities aren't penalized, readers would want to know if there are potential solutions.
</rationale>
<question>
What methods could penalize anatomical deformities in AI image generation?
</question>
</item>

<item>
<rationale>
The answer focuses on hands, but readers would wonder about other common AI image generation problems.
</rationale>
<question>
What other body parts do AI image generators struggle with?
</question>
</item>

<item>
<rationale>
Since training data is mentioned as crucial, readers would be interested in data quality impacts.
</rationale>
<question>
How does training data quality affect hand rendering?
</question>
</item>

<item>
<rationale>
The answer sugge

In [38]:
response = generation(generation_prompts[-1], input_content)

In [39]:
print(response)

<item>
<rationale>
Asks about practical experience with loss functions, focusing on how real implementations handle specific image quality issues.
</rationale>
<question>
What specific loss functions have you found most effective at catching anatomical errors in generated images?
</question>
</item>

<item>
<rationale>
Explores the practical challenges of implementing deformity detection in training, drawing on the respondent's hands-on experience.
</rationale>
<question>
How do you balance penalizing deformities against maintaining creative freedom in your image generation models?
</question>
</item>

<item>
<rationale>
Seeks insights about real-world solutions to the caption-image relationship problem mentioned in the answer.
</rationale>
<question>
What caption preprocessing techniques help improve anatomical accuracy in your generated images?
</question>
</item>

<item>
<rationale>
Builds on the answer's discussion of training objectives by asking about practical metrics for succes