# Convert your judge into a generation prompt

Given
- a certified judge classifier `JUDGE_CLASSIFICATION_PROMPT: str`
- some input string data samples `INPUT_CONTENTS: tuple[str]`
- initial prompt `INITIAL_GENERATION_PROMPT: str`

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Foreword

If you can't verify what is good or what is bad, you can't make good generations.

But if you can verify, you should easily be able to make good generations - and this tool helps you write this prompt.


#### Inspirations

- Cohere prompt tuner https://cohere.com/blog/intro-prompt-tuner
- Anthropic workbench https://console.anthropic.com/workbench/
- Chatbot Arena leaderboard https://lmarena.ai/

## Example use case presented in this notebook

Given a Quora answer, write followup questions to the answer.

Great followup questions should be appealing to respond to and answers should be appealing to read.

The inputs for this use case
- `JUDGE_CLASSIFICATION_PROMPT: str` - prompt that classifies whether one set of prompts is better than the other
- `INPUT_CONTENTS: tuple[str]` - Quora answers
- `INITIAL_GENERATION_PROMPT: str` - prompt that generates followup questions

Produce
- a prompt with a good winrate (measured by the judge) against other prompts

## Frequently asked questions

Why don't you include `JUDGE_CLASSIFICATION_PROMPT` in the generation prompt?
- You can if you want. But the result still needs to perform better according to the judge.
- I expect models in the future to be brainstorming responses and then think carefully which is the response is the best. However, doing so will incur extra cost and latency, and we might not want this tradeoff.
- The idea of prompt engineering for the generation prompt is to teach the model shortcuts on what good outputs are.

Why don't you tune the `JUDGE_CLASSIFICATION_PROMPT` as well?
- In this tool we assume that we trust `JUDGE_CLASSIFICATION_PROMPT`.
- It is important to get this right. You should tune this, and tune this elsewhere.
- In practice we do want to provide feedback to point out where the judge is obviously wrong. I leave this to the roadmap.

Why is the `JUDGE_CLASSIFICATION_PROMPT` doing comparison of two generation outputs instead of classifying whether one is good?
- For some answers it is easy to generate good followup questions but tricky to generate great followup questions.
- In other answers it may be even difficult to generate followup questions that are not outright bad.
- I want the judge to be useful in both cases.

# TODO

- Print more intermediate results to html
- Improve the optimization prompt
- Make the optimization prompt write all of system prompt, user prompt and assistant prefill

In [1]:
from functools import cache

import anthropic
client = anthropic.Anthropic()

# Inputs for the tool user

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("content.csv")

In [3]:
df.head(2)

Unnamed: 0,content
0,"Why do AI image generators have so much trouble rendering hands?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nThe current AI image generators are not rewarded for drawing hands well.\nThe training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).\nL(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).\nHowever, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.\nIf you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question."
1,"At what point do you consider an LLM to have an ability to plan?\n\nTong Hui Kang\nMachine Learning Engineer in Recommendations\nConsider the following query and response\nUser:\nHow many r are there in strawberry? Check your answer\nAssistant:\nThere are 2 'r' letters in the word ""strawberry"".\nTo check this answer, let's break down the word:\ns-t-r-a-w-b-e-r-r-y\nWe can see that there are indeed two instances of the letter 'r' in ""strawberry"":\n1. The first 'r' comes after 't'\n2. The second 'r' is the second-to-last letter\nThis confirms that the answer of 2 'r' letters is correct.\n\nhttps://poe.com/s/X6Rkh9wDJ50PSc...\nThe LLM has planned to BS their justification when they print the number “2”.\nBad planning is still planning."


In [4]:
INPUT_CONTENTS: tuple[str] = tuple(df["content"])

In [5]:
INITIAL_GENERATION_PROMPT = """
You are given the following source question and a source answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

Write 5 followup questions to the answer.

Reply in this format 

<questions>

<question>
question
<question>

...

</questions>
""".strip()

In [6]:
JUDGE_CLASSIFICATION_PROMPT = """
You are given an answer and two sets of followup questions.

Determine which set of followup questions is better.

These are required
- Each question should appear between <question> and </question>
- There should be 5 questions

The followup questions are considered bad if they have any of these characteristics
- There ambiguous references in the question that cannot be understood without context
- The question is using extra words than necessary

If both set of questions have neither of the bad characteristics, the better set of followup questions should have these characteristics
- Each question should be distinct
- The questions ask for knowledge that is not easily found online
- The author is likely to answer the questions
- People interested in the source question and answer will also be interested in answers to the followup questions

This is the source question and source answer
<source_question_and_answer>
{input_content}
</source_question_and_answer>

This is the first set of followup questions

{response_one}

This is the second set of followup questions

{response_two}

Write some reasoning, end your response with one for the following
- Set <label>one</label> is better.
- Set <label>two</label> is better.
- Both sets <label>tie</label>.
""".strip()

# Judge classification prompt

In [7]:
# We assume that we trust this prompt
judge_classification_prompt = JUDGE_CLASSIFICATION_PROMPT

def judge_classification(
    input_content: str,
    response_one: str,
    response_two: str,
) -> tuple[str, str]:
    # return either one, two or tie
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": judge_classification_prompt.format(
                    input_content=input_content,
                    response_one=response_one,
                    response_two=response_two,
                )
            },
        ],
    )
    message_text = message.content[0].text
    if "<label>one</label>" in message_text:
        return message_text, "one"
    if "<label>two</label>" in message_text:
        return message_text, "two"
    return message_text, "tie"

In [8]:
input_content = """
Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.
""".strip()

response_one = """
<question>What are some AI image generation models?</question>

<question>Where can I find the training set for the image models?</question>

<question>What are some ways to improve the quality of AI-generated images?</question>

<question>What is the role of captions in AI image generation?</question>

<question>What techniques have researchers tried so far to address the hand rendering problem?</question>
"""

response_two = """
<question>How do we reward AI image generators to draw hands correction?</question>

<question>Why are hands particularly challenging compared to other anatomical features?</question>

<question>What preprocessing techniques have worked best for curating training datasets that lead to better hand renderings?</question>

<question>What specific loss functions have you found most effective for detecting and penalizing anatomical deformities in AI-generated images?</question>

<question>How do different AI image generation models (like Stable Diffusion, DALL-E, Midjourney) compare in their ability to render hands?</question>
"""

In [9]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_one,
    response_two=response_two,
)

In [10]:
judgement

'two'

In [11]:
print(justification)

Let me analyze both sets:

Set One:
- Questions are clear and concise
- Questions are fairly general and basic
- Most answers can be easily found online
- Some questions (like "What are some AI image generation models?") don't directly build on the specific topic of hand rendering issues
- Questions feel less focused on the core problem discussed in the answer

Set Two:
- Questions are specifically focused on the hand rendering problem
- Questions build directly on the concepts mentioned in the answer (loss functions, rewards, training)
- Questions require expert knowledge and experience to answer
- Questions are likely to generate interesting technical discussions
- Each question approaches the problem from a different angle (rewards, anatomical challenges, preprocessing, loss functions, comparative analysis)
- The person who answered the original question would likely have valuable insights to share on these topics

Set two demonstrates better characteristics because:
1. The question

In [12]:
justification, judgement = judge_classification(
    input_content=input_content,
    response_one=response_two,  # swapped
    response_two=response_one,  # swapped
)

In [13]:
judgement

'one'

In [14]:
print(justification)

Let me analyze both sets:

Set One:
- Questions are specific and focused on the hand rendering problem
- Questions build directly on the core topic
- Questions seek expert knowledge that would be valuable to the community
- Questions are distinct from each other
- Questions are well-formed without ambiguous references
- The author, being a Machine Learning Engineer, would likely be able to provide valuable insights on these technical questions

Set Two:
- Questions are more general and broad
- Some questions like "What are some AI image generation models?" are easily answerable through a quick internet search
- "Where can I find the training set" and "What are some ways to improve quality" are too broad and not specifically related to the hand rendering problem
- Questions don't take full advantage of the author's expertise
- While clear, the questions don't maintain strong relevance to the specific problem discussed in the answer

Reasoning:
1. Set One maintains focus on the specific 

# Generation prompt

In [15]:
def generation(
    generation_prompt: str,
    input_content: str,
) -> str:
    try:
        generation_prompt_with_inputs = generation_prompt.format(
            input_content=input_content,
        )
    except:
        return "{input_content} should appear in the prompt"

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": generation_prompt_with_inputs
            },
        ],
        stop_sequences=["</questions>"],
    )
    message_text = message.content[0].text
    return message_text

In [16]:
response = generation(INITIAL_GENERATION_PROMPT, input_content)

In [17]:
print(response)

<questions>

<question>
How could we modify the loss function L(caption, image) to specifically penalize hand deformities in generated images?
</question>

<question>
Are there specific types of captions or prompts that tend to produce better hand renderings in current AI image generators?
</question>

<question>
Why do hands specifically seem to be more challenging for AI to render compared to other body parts or objects?
</question>

<question>
Could transfer learning from models specifically trained on hand images improve the overall quality of hand generation?
</question>

<question>
How do different AI image generators (like DALL-E, Midjourney, Stable Diffusion) compare in their ability to render hands accurately?
</question>




# Win rate calculation

In [18]:
import concurrent.futures

@cache
def calculate_winrate(
    input_contents: tuple[str],
    generation_prompt_one: str,
    generation_prompt_two: str,
    filename_suffix: str = "",
):
    one_win = 0 
    two_win = 0

    one_wins_judgements = []
    one_loses_judgements = []
    two_wins_judgements = []
    two_loses_judgements = []
    one_tie_judgements = []
    two_tie_judgements = []
    
    def calculate_winrate_single(input_content, index):
        response_one = generation(
            generation_prompt=generation_prompt_one,
            input_content=input_content,
        )
        response_two = generation(
            generation_prompt=generation_prompt_two,
            input_content=input_content,
        )
        flipped = (index%2 == 1)
        if not flipped:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_one,
                response_two=response_two,
            )
        else:
            justification, judgement = judge_classification(
                input_content=input_content,
                response_one=response_two,
                response_two=response_one,
            )
        return response_one, response_two, justification, judgement, flipped
        
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        results = executor.map(calculate_winrate_single, input_contents, list(range(len(input_contents))))
        results = list(results)    
    
    judgement_unflipped = []
    for _, _, justification, judgement, flipped in results:
        if not flipped:
            if judgement == "one":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "two":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
        else:
            if judgement == "two":
                one_win += 1
                judgement_unflipped.append("one")
                one_wins_judgements.append(justification)
                two_loses_judgements.append(justification)
            elif judgement == "one":
                two_win += 1
                judgement_unflipped.append("two")
                two_wins_judgements.append(justification)
                one_loses_judgements.append(justification)
            else:
                one_win += 1/2
                two_win += 1/2
                judgement_unflipped.append("tie")
                one_tie_judgements.append(justification)
                two_tie_judgements.append(justification)
    
    if filename_suffix:
        df_to_display = pd.DataFrame(
            {
                "input_content": [""] + list(input_contents),
                "response_one": [generation_prompt_one] + [response_one for response_one, _, _, _, _ in results],
                "response_two": [generation_prompt_two] + [response_two for _, response_two, _, _, _ in results],
                "judgement": [""] + judgement_unflipped,
                "justification": [""] + [justification for _, _, justification, _, _ in results],
            }
        )
        display_dataframe(df_to_display, filename_suffix=filename_suffix)
        
    return (
        (
            one_win / (one_win + two_win),
            one_wins_judgements,
            one_tie_judgements,
            one_loses_judgements,
        ),
        (
            two_win / (one_win + two_win),
            two_wins_judgements,
            two_tie_judgements,
            two_loses_judgements,
        ),
    )

In [19]:
import os
import html
from IPython.display import display, HTML

def display_dataframe(df: pd.DataFrame, filename_suffix=""):
    html_prefix = '''
    <meta charset="UTF-8">
    <style>
    table {
        border-collapse: collapse;
    }
    td, th {
        border: 1px solid black;
        padding: 5px;
        vertical-align: top;
    }
    td {
        white-space: pre-wrap;
        font-family: monospace;
    }
    </style>
    '''
    
    # Define a single style function that highlights response_one or response_two based on judgement
    def highlight_responses(row):
        styles = ['' for _ in row]
        if row['judgement'] == 'one':
            styles[row.index.get_loc('response_one')] = 'background-color: #90EE90'
        elif row['judgement'] == 'two':
            styles[row.index.get_loc('response_two')] = 'background-color: #90EE90'
        return styles

    os.makedirs("html_output", exist_ok=True)
    output_table_file_name = f"html_output/winrate_calculation{filename_suffix}.html"
    
    # Replace newline characters and escape HTML
    styled_df = df.replace({r'\n': '__NEWLINE__'}, regex=True).applymap(str).applymap(html.escape).replace({'__NEWLINE__': '<br>'}, regex=True)
    
    # Apply the style function
    styled_df = styled_df.style.apply(highlight_responses, axis=1)
    
    # Write the styled DataFrame to an HTML file
    with open(output_table_file_name, 'w') as f:
        f.write(html_prefix + styled_df.render(index=False, escape=False))
    
    # Create a clickable link to the HTML file
    link = f'<a href="{output_table_file_name}" target="_blank">{output_table_file_name}</a>'
    display(HTML(link))

In [20]:
evaluation_one, evaluation_two = calculate_winrate(
    input_contents = INPUT_CONTENTS,
    generation_prompt_one = INITIAL_GENERATION_PROMPT,
    generation_prompt_two = INITIAL_GENERATION_PROMPT,
    filename_suffix = "-demo",
)

In [21]:
print(str(evaluation_one)[:500])

(0.5056179775280899, ["Let me analyze both sets:\n\nFirst set:\n- Questions are clear and specific\n- No ambiguous references\n- Questions are concise\n- Good mix of technical and practical questions\n- Questions build naturally on the source answer\n- The question about human artists adds an interesting cross-disciplinary perspective\n- Asks about specific differences between generators, which is practical\n\nSecond set:\n- Questions are clear and specific\n- No ambiguous references\n- Question


# Prompt optimization

In [22]:
evaluation_string_template = """
Winrate: {winrate}
Cases where the prompt won: {win_judgements}
Cases where the prompt ties: {tie_judgements}
Cases where the prompt loses: {lose_judgements}
"""

In [23]:
optimization_prompt_template = """
Improve the generation prompt according to the feedback

<current_generation_prompt>
{generation_prompt}
</current_generation_prompt>

<feedback>
{evaluation_string}
</feedback>

This is the judging criteria
<judging_criteria>
{judging_criteria}
</judging_criteria>

Summarize the changes that you intend to make, and return the new prompt between <prompt> and </prompt>.
""".strip()

In [24]:
def optimization(
    generation_prompt: str,
    evaluations: list, 
) -> str:
    evaluation_string = ""
    for evaluation in evaluations:
        winrate, win_judgements, tie_judgements, lose_judgements = evaluation
        evaluation_string_single = evaluation_string_template.format(
            winrate=winrate,
            win_judgements=win_judgements,
            tie_judgements=tie_judgements,
            lose_judgements=lose_judgements,
        )
        evaluation_string += evaluation_string_single

    optimization_prompt = optimization_prompt_template.format(
        generation_prompt=generation_prompt,
        evaluation_string=evaluation_string,
        judging_criteria=JUDGE_CLASSIFICATION_PROMPT,
    )

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": optimization_prompt
            },
        ]
    )
    
    optimization_response = message.content[0].text
    optimized_prompt = extract_from_tags(optimization_response, tag_string="prompt")
    
    return optimization_response, optimized_prompt

In [25]:
import re
def extract_from_tags(text, tag_string="prompt"):
    pattern = f'<{tag_string}>(.*?)</{tag_string}>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return ''

In [26]:
optimization_response, optimized_prompt = optimization(INITIAL_GENERATION_PROMPT, [evaluation_one])

In [27]:
print(optimization_response[:1000])

Based on the feedback analysis, here are the key improvements needed for the generation prompt:

1. Better specify quality criteria in the prompt:
- Questions should be distinct from each other
- Questions should ask for knowledge not easily found through online searches
- Questions should be likely to get responses from the author based on their demonstrated expertise
- Questions should maintain relevance to anyone interested in the original topic

2. Explicitly state avoidance criteria:
- No ambiguous references that require additional context
- No unnecessary words or verbose phrasing

3. Add guidance for question progression and flow:
- Questions should build naturally from the source material
- Questions should follow a logical progression
- Questions should maintain connection to key points in the original answer

4. Emphasize balance between:
- Technical and practical aspects
- Specific details and broader implications
- Personal experience and general knowledge

Here's the impr

In [28]:
print(optimized_prompt)

You are given the following source question and a source answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

Write 5 followup questions to the answer. The questions should:
- Be distinct from each other without overlap
- Ask for knowledge that isn't easily found through online searches
- Be likely to get responses based on the author's demonstrated expertise
- Be relevant to people interested in the original topic
- Build naturally from the source material
- Follow a logical progression
- Maintain clear connection to key points in the original answer
- Balance technical and practical aspects where appropriate

Avoid:
- Ambiguous references that require additional context
- Unnecessary words or verbose phrasing

Reply in this format:

<questions>

<question>
[Your question here]
</question>

[Repeat for all 5 questions]

</questions>


# Iterative optimization

In [29]:
generation_prompts = [INITIAL_GENERATION_PROMPT, INITIAL_GENERATION_PROMPT]

In [30]:
import random
for latest_index in range(1, 4):
    generation_prompt_latest = generation_prompts[-1]
    evaluations = []
    for previous_index, generation_prompt_old in enumerate(generation_prompts[:-1]):
        random.seed(f"{previous_index} {latest_index}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        _, evaluation = calculate_winrate(
            input_contents,
            generation_prompt_old,
            generation_prompt_latest,
            filename_suffix = f"-{previous_index}-{latest_index}",
        )
        evaluations.append(evaluation)

    optimization_response, generation_prompt_new = optimization(generation_prompt_latest, evaluations)
    generation_prompts.append(generation_prompt_new)
    print(generation_prompt_new)
    print("---")

You are given the following source question and a source answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

Write 5 followup questions that:
- Flow naturally from the source answer
- Are direct and concise (avoid phrases like "Can you elaborate" or "Could you provide")
- Focus on distinct aspects with no overlap between questions
- Ask for insights not easily found through online searches
- Can be answered based on the author's demonstrated knowledge
- Balance technical and practical aspects when applicable
- Provide value to people interested in the original topic

Reply in this format:

<questions>

<question>
question
</question>

...

</questions>
---


You are given the following source question and a source answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

Write 5 followup questions that:
- Flow naturally from specific points in the source answer
- Use simple, clear language without unnecessary words
- Focus on practical rather than theoretical aspects
- Ask about specific experiences and observations
- Target topics where the author has demonstrated knowledge
- Ask for insights not easily found through online searches
- Would interest readers of the original answer
- Are distinct with no overlap between questions

Reply in this format:

<questions>

<question>
question
</question>

...

</questions>
---


You are given the following source question and a source answer.

<source_question_and_answer>
{input_content}
</source_question_and_answer>

Write 5 followup questions that:
- Build directly from specific points mentioned in the source answer
- Ask for personal experiences, examples and observations where appropriate
- Focus on practical implementation rather than theory
- Seek insights not easily found through online searches
- Target topics where the author has demonstrated knowledge/experience
- Use concise language while maintaining clarity
- Are distinct with no overlap between questions
- Avoid assumptions not supported by the source material

Reply in this format:

<questions>

<question>
question
</question>

...

</questions>
---


# Display winrate table

In [31]:
import numpy as np
win_rate_matrix = [[np.nan for _ in generation_prompts] for _ in generation_prompts]

for one_idx, prompt_one in enumerate(generation_prompts):
    for two_idx, prompt_two in enumerate(generation_prompts[one_idx+1:], start=one_idx+1):
        random.seed(f"{one_idx} {two_idx}")
        input_contents = tuple(random.sample(INPUT_CONTENTS, min(20, len(INPUT_CONTENTS))))
        evaluation_one, evaluation_two = calculate_winrate(
            input_contents,
            prompt_one,
            prompt_two,
            filename_suffix = f"-{one_idx}-{two_idx}",
        )
        win_rate_one, _, _, _ = evaluation_one
        win_rate_two, _, _, _ = evaluation_two
        win_rate_matrix[one_idx][two_idx] = win_rate_one
        win_rate_matrix[two_idx][one_idx] = win_rate_two

In [32]:
win_rate_matrix

[[nan, 0.45, 0.45, 0.2, 0.425],
 [0.55, nan, 0.25, 0.25, 0.4],
 [0.55, 0.75, nan, 0.3, 0.4],
 [0.8, 0.75, 0.7, nan, 0.5],
 [0.575, 0.6, 0.6, 0.5, nan]]

In [33]:
import numpy as np
from IPython.display import HTML

num_prompts = len(generation_prompts)

# Start building the HTML table
html_table = "<table border='1' style='border-collapse: collapse;'>"

# Create the header row
html_table += "<tr><th></th>"
for j in range(num_prompts):
    html_table += f"<th>Prompt {j}</th>"
html_table += "</tr>"

for i in range(num_prompts):
    html_table += f"<tr><th>Prompt {i}</th>"
    for j in range(num_prompts):
        if i == j or np.isnan(win_rate_matrix[i][j]):
            html_table += "<td></td>"  # Empty cell for diagonal or undefined win rates
        else:
            win_rate = win_rate_matrix[i][j]
            cell_html = f'<a href="html_output/winrate_calculation-{min(i,j)}-{max(i,j)}.html" target="_blank" style="text-decoration: none;">{win_rate:.2f}</a>'
            html_table += f"<td>{cell_html}</td>"
    html_table += "</tr>"
html_table += "</table>"

display(HTML(html_table))

Unnamed: 0,Prompt 0,Prompt 1,Prompt 2,Prompt 3,Prompt 4
Prompt 0,,0.45,0.45,0.2,0.42
Prompt 1,0.55,,0.25,0.25,0.4
Prompt 2,0.55,0.75,,0.3,0.4
Prompt 3,0.8,0.75,0.7,,0.5
Prompt 4,0.57,0.6,0.6,0.5,


# Conclusion
You notice that the prompts keeps getting better, the each prompt is better than the previous prompt, according the the judge we assume we trust.

In [34]:
print(input_content)

Why do AI image generators have so much trouble rendering hands?

Tong Hui Kang
Machine Learning Engineer in Recommendations
The current AI image generators are not rewarded for drawing hands well.
The training input for image models is a huge set of images and their captions. The objective here is to fit a function f(caption) -> image, by minimizing some loss function L(caption, image).
L(caption, image) is high when the image is unrelated to the caption, and L(caption, image) is low when the image is very related to the caption. The image generation model is trained to minimize L(caption, image).
However, L(caption, image) might be high for images with deformities, because L(caption, image) might only measure whether the image is related to the caption.
If you don’t penalize deformities in the training, you can’t generate images free-of-deformities. As of my current understanding, generating images without deformities is still an open question.


In [35]:
response = generation(generation_prompts[0], input_content)

In [36]:
print(response)

<questions>

<question>
How could we modify the loss function L(caption, image) to specifically penalize hand deformities in generated images?
</question>

<question>
Are there any specific architectural changes in AI models that could help improve the rendering of hands, or is it purely a training data/loss function issue?
</question>

<question>
Do other complex anatomical features (like faces or feet) face similar rendering challenges, or is this problem unique to hands?
</question>

<question>
Could incorporating 3D anatomical models or skeletal structure information into the training process help improve hand rendering?
</question>

<question>
How do human artists learn to draw hands correctly, and could these principles be somehow incorporated into AI training methods?
</question>




In [37]:
response = generation(generation_prompts[-1], input_content)

In [38]:
print(response)

<questions>

<question>
Could you share specific examples of loss functions you've implemented or encountered that effectively measure image-caption relationships while avoiding deformity issues?
</question>

<question>
In your experience working with image generation models, what modifications to training objectives have shown the most promise in reducing hand deformities specifically?
</question>

<question>
How do you practically evaluate and measure deformities in generated images during the training process, beyond just caption relevance?
</question>

<question>
Based on your machine learning background, what approaches for penalizing anatomical irregularities have you seen attempted in real-world implementations?
</question>

<question>
When working with recommendation systems, have you found any parallels between detecting poor quality content and identifying deformities in generated images that could be applied here?
</question>


