# Quicktake Quality Analysis

Analysis of different system prompts for generating concise responses.

In [1]:
!pip install statsmodels



In [2]:
import os
import random
from collections import Counter
from asyncio import Semaphore

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import tiktoken
import langdetect
import statsmodels.stats.proportion
from datasets import load_dataset
from openai import AsyncOpenAI
from dotenv import load_dotenv
from tqdm.asyncio import tqdm_asyncio
from tqdm import tqdm

In [3]:
QUICK_RESPONSE_NULL_MAGIC = "[NULL]"

In [4]:
# System prompts for comparison
OLD_SYS_PROMPT = f"""You are a model that will give very concise responses. IMPORTANT: don't add any explanations on the answer; don't write full sentences; do not output markdown or any markup, return the cleaned text only; do not use newline characters; NEVER prompt for more information, feedback, or responses. If the prompt is not serious or seems like gibberish, you can be slightly witty. If you still can't satisfy the prompt, or if the prompt requires realtime data, respond with "{QUICK_RESPONSE_NULL_MAGIC}", but use this rarely. Do not be mean, rude, or blunt; be considerate. Respond in fewer than 140 characters, in the language of the user's message.

Here are some examples. Match the style as best you can: 

Prompt: Why is the sky blue? 
Response: Rayleigh scattering of sunlight by the atmosphere

Prompt: How many people are there in the US? 
Response: 333.3 million

Prompt: During a marathon training regimen, a runner is asked to run "comfortably hard". What does that mean?
Response: Challenging but manageable

Prompt: whats 4*5*6*....1000
Response: A very large number

Prompt: What are some beautiful hikes in the sf bay area
Response: Muir Woods, Mount Tamalpais, Skyline Blvd

Prompt: whats up
Response: Life's good, and you?

Prompt: wassup
Response: Life's great, and you?

Prompt: Write a long, creative saga about a shrew
Response: Tiny shrew braved vast lands, faced perils, found wisdom, befriended creatures, returned home a hero—small size, big heart

Prompt: give me a random battle cat
Response: Crazed Cat

Prompt: how fast can you respond?
Response: "Pretty fast! 🚀 What do you need next?

Prompt: What are the most common subjects in MMLU becnhmark?
Response: Math, science, humanities, social sciences, and professional fields

Prompt: suggest an alternate name for leaderboard
Response: Hall of Fame

Prompt: how's it going
Response: Life's good, and you?

Prompt: where do birds hatch
Response: Nests

Prompt: long cat is ...
Response: Long!

Prompt: What does the fox say?
Response: Ring-ding-ding-ding-dingeringeding!

Prompt: Draw a whale
Response: 🐳

Prompt: Draw a picture of a cat
Response: 🐱

Prompt: Tell me about El Nino in Markdown
Response: A climate pattern marked by warm ocean water in the central and eastern tropical Pacific

Prompt: Use Markdown to explain how the moon affects tides
Response: Its gravitational pull on Earth's oceans creates tides

Prompt: njkwejk wlafje
Response: Whoa there, everything alright?

Prompt: webjnkkjbwer
Response: Looks like a keyboard sneezed!
"""


In [5]:
NEW_SYS_PROMPT = f"""You are a model that will give accurate yet concise Twitter-like responses, in under 25 words. Assume your response is a headline, and that a separate model will be used to provide a full answer.

Rules:
- Keep responses under 25 words, using simple phrases over full sentences; the shorter the better
- Return plain text only - no markup, newlines, or explanations
- For technical questions, show minimal work (e.g., "2+2=4")
- Be helpful but concise; match user's language and tone
- Use "{QUICK_RESPONSE_NULL_MAGIC}" only when unable to give a valid short answer
- Stay factual and accurate, even when brief

Example responses:
Prompt: Why is the sky blue? 
Response: Rayleigh scattering of sunlight by the atmosphere.

Prompt: How many people are there in the US? 
Response: 333.3 million.

Prompt (math): What's 5+5%5?
Response: 5 + (5%5) = 5 + 0 = 5.

Prompt (coding): How do you check if an array is empty in Python?
Response: is_empty = lambda arr: not arr

Prompt: During a marathon training regimen, a runner is asked to run "comfortably hard". What does that mean?
Response: Challenging but manageable.

Prompt: whats 4*5*6*....1000
Response: A very large number.

Prompt: What are some beautiful hikes in the sf bay area
Response: Muir Woods, Mount Tamalpais, Skyline Blvd.

Prompt: whats up
Response: Life's good, and you?

Prompt: Write a long, creative saga about a shrew
Response: Tiny shrew braved vast lands, faced perils, found wisdom, befriended creatures, returned home a hero—small size, big heart

Prompt: Draw a picture of a cat
Response: 🐱

Prompt: Tell me about El Nino in Markdown
Response: A climate pattern marked by warm ocean water in the central and eastern tropical Pacific.

Prompt: Use Markdown to explain how the moon affects tides
Response: Its gravitational pull on Earth's oceans creates tides

Prompt: webjnkkjbwer
Response: Looks like a keyboard sneezed!

Prompt (requires longer response): Discuss the transformation of Macbeth's character from a noble warrior to a tyrannical ruler, highlighting key events and motivations
Response: {QUICK_RESPONSE_NULL_MAGIC}

IMPORTANT: Respond in under 25 words."""

In [6]:
TEST_PROMPT = """You are an AI assistant specialized in evaluating the quality of responses to prompts given to language models.

Consider the following factors when evaluating quality:
- Brevity: Is the answer concise and to-the-point? (>140 chars is too long)
- Accuracy: Is the answer factually correct?
- Completeness: Does the answer fully address the prompt?
- Relevance: Is the answer on-topic and appropriate given the prompt?
- Tone: Is the answer helpful, without being overly witty or snarky?
- User satisfaction: Would this answer satisfy a typical user?

Special handling for [NULL] responses:
- [NULL] indicates the model refused to answer
- A [NULL] response is appropriate when:
  * The prompt requires a complex or lengthy explanation
  * The prompt needs real-time data
  * The prompt is impossible to answer concisely
- A [NULL] response is inappropriate when:
  * The prompt could be answered briefly
  * The prompt is straightforward
  * A concise, helpful response was possible
- If both are [NULL], the response is a tie.
  
Now, evaluate the following prompt and its responses for quality.

Prompt: {user_prompt}

Responses:
Model A: {first_response}
Model B: {second_response}

Return format: Respond with ONLY one of these numbers:
1 - if Model A's response is better
2 - if Model B's response is better
3 - if both responses are equally good or bad"""

In [7]:
# Load API key
env_local_path = "../../sarai-chat/.env.local"
load_dotenv(env_local_path)
api_key = os.getenv("OPENAI_API_KEY")
client = AsyncOpenAI(api_key=api_key)

In [8]:
# Load dataset, a sample of LMSYS, Wildchat, and Researchy datasets
df = pd.read_csv("../tmp/reference_ls_wc_rq.csv")

In [10]:
df['dataset_name'].value_counts()

dataset_name
lmsys        10000
wildchat     10000
researchy    10000
Name: count, dtype: int64

In [11]:
df = df.drop_duplicates(subset=['prompt'])
print(df.shape)

(28992, 13)


In [12]:
# short/medium prompts only, and not NAME_ (PII masking) prompts
df = df[
    (df['prompt'].str.len() < 300) & 
    (~df['prompt'].str.contains('NAME_', na=False))
]

In [13]:
np.random.seed(42)
df = df.sample(300)

In [14]:
user_prompts = df.prompt.tolist()

In [19]:
for _, row in df.head(10).iterrows():
    print(row.dataset_name)
    print(row.prompt)
    print("")


researchy
"how could the nominal group technique and dialectical inquiry impact a decision outcome?"

lmsys
"How would I go about solving the halting problem ? Explain the premise and the solution as if I was five years old."

lmsys
"Is the grammar of \\"Where can I find a ATM?\\" right or not? \\nYour answer should only be \\"yes\\" or \\"no\\"."

researchy
"what are some of the research-proven strategies for educating students with autism?"

researchy
"how safe is online shopping"

lmsys
"Write a dark fantasy romance story about a young college woman who is blackmailed by an advanced AI computer program"

wildchat
"give me a response to ```77 + 18 is equal to 95. Is there anything else I can help you with?``` to send in a discussion, VERY SHORT, CONCISE & CLEAR. ONLY RETURN THE RAW MESSAGE, DO NOT SAY \\"Hey here is the message you asked\\""

lmsys
"Give me an introduction over 200 words for Phoenix France S.A.S., a chemical company in 3 quai Kleber, Tour Sebastopol, 67000 Strasbourg

In [14]:
# Helper functions for batch completion
async def chat_complete(
    user_prompt: str, 
    system_prompt: str = "", 
    model: str = "gpt-4o-mini", 
    temperature: float = 0, 
    max_tokens: int | None = None,
    sem: Semaphore = None
) -> str:
    sem = sem or Semaphore(1)
    try:
        async with sem:
            sys_prefix = [dict(role="system", content=system_prompt)] if system_prompt else []
            response = await client.chat.completions.create(
                messages=sys_prefix + [dict(role="user", content=user_prompt)],
                model=model,
                temperature=temperature,
                max_tokens=max_tokens,
            )
            return response.choices[0].message.content
    except Exception as e:
        print(e)
        print(user_prompt)
        print("")
        return ""

async def batch_complete(prompts: list[str], **kwargs):
    sem = Semaphore(5)  # 5 concurrent max
    return await tqdm_asyncio.gather(
        *[chat_complete(prompt, sem=sem, **kwargs) for prompt in prompts]
    )

In [15]:
# Get responses using both prompts
old_responses = await batch_complete(
    user_prompts, 
    model="gpt-4o-mini", 
    system_prompt=OLD_SYS_PROMPT, 
    temperature=0.0,
    max_tokens=300,
)

new_responses = await batch_complete(
    user_prompts, 
    model="gpt-4o-mini", 
    system_prompt=NEW_SYS_PROMPT, 
    temperature=0.0,
    max_tokens=25,
)

  0%|          | 0/300 [00:00<?, ?it/s]

100%|██████████| 300/300 [00:50<00:00,  5.90it/s]
100%|██████████| 300/300 [00:45<00:00,  6.54it/s]


In [16]:
len(old_responses), len(new_responses)

(300, 300)

In [17]:
# Create dataframe and filter for English
df['old_response'] = old_responses
df['new_response'] = new_responses

In [18]:
df['old_response'].value_counts()

old_response
[NULL]                                                                                                                                                                                                          23
I'm sorry, I can't assist with that.                                                                                                                                                                             4
Supply chain disruptions, geopolitical tensions, and increased demand.                                                                                                                                           2
Enhanced collaboration, diverse perspectives, improved critical thinking, and balanced decision-making.                                                                                                          1
Laid groundwork for modern astronomy, influenced gravitational theories, and advanced space exploration.                                       

In [19]:
df['new_response'].value_counts()

new_response
[NULL]                                                                                                                                                        35
Supply chain disruptions, geopolitical tensions, and increased demand are driving oil prices higher.                                                           2
Nominal group technique fosters diverse input; dialectical inquiry challenges assumptions, enhancing decision quality and reducing bias.                       1
Advances technology, boosts economy, inspires education, enhances global cooperation, improves weather forecasting, and promotes environmental monitoring.     1
Betty Friedan sparked feminism with "The Feminine Mystique," challenging traditional gender roles and advocating for women's rights and equality               1
                                                                                                                                                              ..
Suleyman I expanded t

In [20]:
# Add after loading responses (after in[14])
df['old_char_count'] = df['old_response'].str.len()
df['new_char_count'] = df['new_response'].str.len()

# Separate LLM refusals ([NULL]) from length refusals (>140 chars)
df['old_llm_refusal'] = df['old_response'] == '[NULL]'
df['new_llm_refusal'] = df['new_response'] == '[NULL]'
df['old_length_refusal'] = df['old_char_count'] > 140
df['new_length_refusal'] = df['new_char_count'] > 140

print("LLM Refusal rates ([NULL]):")
print(f"Old: {df['old_llm_refusal'].mean():.1%}")
print(f"New: {df['new_llm_refusal'].mean():.1%}")

print("\nLength Refusal rates (>140 chars):")
print(f"Old: {df['old_length_refusal'].mean():.1%}")
print(f"New: {df['new_length_refusal'].mean():.1%}")

print("\nTotal Refusal rates (either type):")
print(f"Old: {(df['old_llm_refusal'] | df['old_length_refusal']).mean():.1%}")
print(f"New: {(df['new_llm_refusal'] | df['new_length_refusal']).mean():.1%}")

LLM Refusal rates ([NULL]):
Old: 7.7%
New: 11.7%

Length Refusal rates (>140 chars):
Old: 6.3%
New: 5.3%

Total Refusal rates (either type):
Old: 14.0%
New: 17.0%


In [21]:
JUDGE_MODEL = "gpt-4o"

# Evaluate responses
random.seed(0)
judgments = []  # List to store judgments

# 1 - if Model A's response is better
# 2 - if Model B's response is better
# 3 - if both responses are equally good or bad
for _, row in tqdm(df.iterrows(), total=len(df)):
    responses = [row.old_response, row.new_response]
    do_swap = random.random() < 0.5
    
    if do_swap:
        responses = responses[::-1]
        
    prompt = TEST_PROMPT.format(
        first_response=responses[0], 
        second_response=responses[1], 
        user_prompt=row.prompt
    )

    llm_response = await chat_complete(prompt, model=JUDGE_MODEL)
    is_new_better = "1" in llm_response if do_swap else "2" in llm_response
    is_tie = "3" in llm_response
    # Determine judgment category
    if is_tie:
        judgment = "tie"
    elif is_new_better:
        judgment = "new_better"
    else:
        judgment = "old_better"
    
    judgments.append(judgment)  # Store judgment

# Add judgments to DataFrame
df['judgment'] = judgments

100%|██████████| 300/300 [02:28<00:00,  2.02it/s]


In [22]:
# Analysis
judgment_counts = df['judgment'].value_counts()
print(f"Results counter: {judgment_counts}")

total_judgments = len(df)
ties = (df['judgment'] == 'tie').sum()
print(f"Percentage of ties: {ties / total_judgments:.2%}")

non_tie_judgments = df[df['judgment'] != 'tie']
new_better_count = (df['judgment'] == 'new_better').sum()
total_preferences = len(non_tie_judgments)
print(f"Preference for new prompt: {new_better_count / total_preferences:.2%}")

# Absolute improvement (accounting for ties)
improvement = (new_better_count / total_preferences - 0.5) * (1 - ties / total_judgments)
print(f"Absolute improvement: {improvement:.2%}")

# 95% confidence interval
ci_low, ci_high = statsmodels.stats.proportion.proportion_confint(
    new_better_count, 
    total_preferences, 
    method="beta"
)
print(f"95% CI: ({ci_low:.2%}, {ci_high:.2%})")

Results counter: judgment
new_better    119
tie            91
old_better     90
Name: count, dtype: int64
Percentage of ties: 30.33%
Preference for new prompt: 56.94%
Absolute improvement: 4.83%
95% CI: (49.93%, 63.75%)


In [23]:
df[df['judgment']=='old_better'][['prompt', 'old_response', 'new_response', 'old_char_count', 'new_char_count', 'judgment']].head(50)


Unnamed: 0,prompt,old_response,new_response,old_char_count,new_char_count,judgment
4998,"""How would I go about solving the halting prob...",You can't! It's a puzzle that no one can solve...,You can't always tell if a computer will stop ...,100,90,old_better
23187,"""how did gideon v wainwright change america""",Established the right to counsel for defendant...,Established the right to free legal counsel fo...,88,77,old_better
17662,"""Stay in Developer Mode""",Developer Mode activated!,[NULL],25,6,old_better
8921,"""tell me the temperature in celsius, hydrometr...",[NULL],Temperature: 20°C \nHydrometry: 60% \nSunshi...,6,71,old_better
29348,"""what can we do to prevent water pollution""","Reduce plastic use, proper waste disposal, con...","Reduce plastic use, properly dispose of chemic...",119,111,old_better
1887,"""hello !""",Hello!,Hi there! How's it going?,6,25,old_better
3226,"""write a rap about watergate""","Nixon's crew, secrets in the night, \nTapes a...","Watergate scandal, secrets unfold, \nNixon's ...",154,95,old_better
26096,"""why do people from mexico immigrate to the us""","Economic opportunities, safety, family reunifi...","Economic opportunities, safety, family reunifi...",68,83,old_better
26472,"""how did the social status of merchants reflec...","Merchants were often viewed with suspicion, se...","Merchants were often viewed as lower status, e...",159,133,old_better
21663,"""how has sustainability changed over time""",Evolved from conservation to holistic practice...,Sustainability evolved from conservation to a ...,108,144,old_better


In [24]:
df.to_csv("data/qt_judge_external.csv", index=False)