#  LLM Testing 

## LLM Testing List
- Structure & Formatting
  * Length control: Ensure response respect token limits.
  * Output structure enforcement: Validate that responses follow required formats (e.g., bullet points, tables, JSON).

- Correctness: Check against expected answers or external source (exact match or similarity)

- Consistency & Stability
  * Reproducibility: Set temperature and seed value.
  * Prompt sensitivity: Assess how changes in wording affect results.
  * Regression testing: Detect output changes over model versions.





### Case 1: Extract infrmation from earning report
* Defines a "golden set" of expected phrases (like unit test assertions).
* Checks that the LLM includes all critical facts.
* Gives a clean pass/fail result with explanation.


In [1]:
from openai import OpenAI
import os
from pathlib import Path
from typing import List

def llm_openai(prompt: str, llm_model: str, temperature, max_tokens) -> str:
    """Call OpenAI ChatCompletion API and return output text."""

    # Set your OpenAI API key as an environment variable before running
    api_key = Path("./openai.key").read_text().strip()
    client = OpenAI(api_key=api_key)

    completion = client.chat.completions.create(
        model=llm_model,
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=temperature,
        max_completion_tokens=max_tokens,
        seed=42,
    )   
    return completion.choices[0].message.content


In [2]:
import datetime
import json
import os

def llm_with_logging(prompt, llm_model, llm_func, temperature, max_tokens):
    """
    Executes an LLM prompt using a provided function, saves the prompt and logs the interaction.
    Args:
        prompt (str): The prompt to send to the LLM.
        llm_model (str): The name of the LLM model to use.
        llm_func (callable): The function to execute the LLM call. Defaults to llm_execute.
    Returns:
        str: The output from the LLM.
    """

    log_dir = "logs"
    os.makedirs(log_dir, exist_ok=True)
    
    now = datetime.datetime.now()
    date_str = now.date().isoformat()
    timestamp_str = now.strftime("%Y%m%d-%H%M%S")
    
    output_file = os.path.join(log_dir, f"logfile_{date_str}.jsonl")

    output = llm_func(prompt, llm_model, temperature, max_tokens)

    log_entry = {
        "timestamp": now.isoformat(),
        "model": llm_model,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "prompt": prompt,
        "output": output,
    }

    with open(output_file, "a") as f:
        f.write(json.dumps(log_entry) + "\n")

    print(f"Logged interaction to {output_file}")
    return output



In [3]:
def test_summary_quality(output: str, expected_phrases: List[str]) -> bool:
    """Unit-style test: check if all expected phrases appear in the output."""
    output = output.lower()
    expected_phrases = [phrase.lower() for phrase in expected_phrases]
    missing = [phrase for phrase in expected_phrases if phrase not in output]
    if missing:
        print("=== Test failed. Missing key points:")
        for m in missing:
            print(f" - {m}")
        return False
    print("=== Test passed. Output contains all expected key points.")
    return True

In [6]:

# --- Realistic business prompt for LLM ---
test_prompt = """
Summarize the following Q1 2024 earnings report for internal strategy briefings. Focus on revenue, profit, growth drivers, and geographic performance:

'In its first quarter of fiscal 2024, Starbucks reported consolidated net revenues of $9.4 billion, representing an 8% increase over the prior year. Net income grew to $1.1 billion, or $0.78 per share, compared to $0.68 per share in Q1 2023. The North America segment saw a 9% revenue increase, fueled by higher average ticket and increased store traffic. International revenue rose 7%, with particularly strong performance in China where same-store sales jumped 11%. Starbucks opened 549 new stores globally in the quarter, bringing its total to over 38,000. CEO Laxman Narasimhan cited continued investment in digital platforms and mobile ordering as a key competitive advantage. Operating margin expanded 40 basis points year-over-year. The company reaffirmed its full-year guidance of high-single-digit revenue growth.'
"""

# --- Define expected content for unit test ---
expected_phrases = [
    "$9.4 billion",
    "$1.1 billion",
    "North America",
    "China",
    # "549 new stores",
    # "digital platforms and mobile ordering",
    # "operating margin"
]

# --- Run the test ---
llm_model = "gpt-4.1-nano"
temperature = 0.7
max_tokens = 1000
output = llm_with_logging(test_prompt, llm_model, llm_openai,temperature, max_tokens)
print("\n=== LLM Output:\n", output)
print("\n=== Running Unit Test...\n")
test_summary_quality(output, expected_phrases)

Logged interaction to logs/logfile_2025-07-01.jsonl

=== LLM Output:
 Q1 2024 Earnings Summary for Internal Strategy Briefing:

**Revenue:**  
Starbucks generated $9.4 billion in net revenues, up 8% year-over-year, reflecting solid growth across key markets.

**Profit:**  
Net income increased to $1.1 billion ($0.78 per share) from previous earnings of $0.68 per share in Q1 2023, indicating improved profitability.

**Growth Drivers:**  
- **Domestic Market:** North America experienced a 9% revenue increase driven by higher average tickets and increased store traffic.  
- **International Markets:** International revenue grew 7%, with China leading the way, where same-store sales surged 11%.  
- **Expansion:** The company opened 549 new stores globally, reaching over 38,000 locations.  
- **Digital Investment:** Continued focus on digital platforms and mobile ordering has strengthened competitive positioning.

**Geographic Performance:**  
- **North America:** Strong growth with a 9% rev

True

### Case 2: Generate marketing materials

* Track changes in model output over time or when prompts/models change.

In [7]:
test_prompt = """Write a compelling headline for an email marketing campaign promoting a new rewards credit card for young professionals. Focus on travel perks, no annual fee, and cash back."""

expected_phrases = ["travel", "no annual fee", "cash back"]

# Run on two model versions
llm_model = "gpt-4.1-nano"
temperature = 0.7
max_tokens = 1000
output = llm_with_logging(test_prompt, llm_model, llm_openai, temperature, max_tokens)
print("\n=== LLM Output:\n", output)
test_summary_quality(output, expected_phrases)

llm_model = "gpt-3.5-turbo"
output = llm_with_logging(test_prompt, llm_model, llm_openai, temperature, max_tokens)
print("\n=== LLM Output:\n", output)
test_summary_quality(output, expected_phrases)


Logged interaction to logs/logfile_2025-07-01.jsonl

=== LLM Output:
 Unlock Your Adventure: Earn Travel Perks, Cash Back, and No Annual Fee with Our New Rewards Credit Card!
=== Test passed. Output contains all expected key points.
Logged interaction to logs/logfile_2025-07-01.jsonl

=== LLM Output:
 "Unlock Your Next Adventure: Introducing the Ultimate Rewards Credit Card for Young Professionals"
=== Test failed. Missing key points:
 - travel
 - no annual fee
 - cash back


False

### Case 3: Check classification outcome

* Adopt embedding to compare similarity 

In [8]:

from openai import OpenAI
def get_embedding_openai(text):
    """
    Get embedding vector for a category string using OpenAI embeddings API.
    """
    api_key = Path("./openai.key").read_text().strip()
    client = OpenAI(api_key=api_key)

    # Use a small model for speed/cost; adjust as needed
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )

    return response.data[0].embedding

def cosine_similarity(vec1, vec2):
    """
    Compute cosine similarity between two vectors.
    """
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))


In [9]:
import pandas as pd
import re
import numpy as np

#  Evaluation Function
def evaluate_product_classification(df, llm_model, prompt, temperature, max_tokens):
    """
    Send each product name to LLM and compare output category to golden label.
    Adds columns for LLM response, predicted category, match status, and error type.
    """
    results = []

    for _, row in df.iterrows():
        product = row["product_name"]
        gold_category = row["ground_truth_category"]

        user_prompt = f"{prompt} {product}\n"

        llm_output = llm_with_logging(user_prompt, llm_model, llm_openai, temperature, max_tokens)

        # Try to extract category from LLM output
        category_match = re.search(r"Category:\s*(.*)", llm_output)
        predicted_category = category_match.group(1).strip().lower() if category_match else "N/A"

        match = predicted_category == gold_category.lower()

        # Get embeddings and similarity
        pred_emb = get_embedding_openai(predicted_category)
        gold_emb = get_embedding_openai(gold_category)
        similarity = cosine_similarity(pred_emb, gold_emb)

        results.append({
            "product_name": product,
            "gold_category": gold_category,
            "llm_response": llm_output,
            "predicted_category": predicted_category,
            "match": match,
            "error_type": "" if match else "misclassification",
            "cosine_similarity": similarity
        })

    return pd.DataFrame(results)



In [10]:
df_products = pd.DataFrame({
    "product_name": [
        "Dyson V11 Torque Drive",
        "Keurig K-Supreme Plus SMART",
        "iPhone 15 Pro Max",
        "YETI Rambler 20 oz Tumbler",
        "Sony WH-1000XM5",
        "Blue Diamond Almonds - Lightly Salted",
        "Peloton Bike+",
        "Nest Thermostat (3rd Gen)",
        "L'Oréal Paris Revitalift Serum",
        "Kindle Paperwhite Signature Edition"
    ],
    "ground_truth_category": [
        "vacuum cleaner",
        "coffee maker",
        "smartphone",
        "drinkware",
        "headphones",
        "snack",
        "exercise equipment",
        "smart home device",
        "skincare",
        "e-reader"
    ]
})

test_prompt = """Classify the following product into a specific consumer product category. 
Reply in the format: 
Category: <your category>
Reason: <your reasoning based on the product name>
Product: """

llm_model = "gpt-4.1-nano"
temperature = 0.7
max_tokens = 1000
df_results = evaluate_product_classification(df_products, llm_model, test_prompt, temperature, max_tokens)

Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl
Logged interaction to logs/logfile_2025-07-01.jsonl


In [11]:
df_results.to_csv("product_classification_results.csv", index=False)
print(df_results[["product_name", "predicted_category", "gold_category", "match","cosine_similarity"]])

                            product_name  \
0                 Dyson V11 Torque Drive   
1            Keurig K-Supreme Plus SMART   
2                      iPhone 15 Pro Max   
3             YETI Rambler 20 oz Tumbler   
4                        Sony WH-1000XM5   
5  Blue Diamond Almonds - Lightly Salted   
6                          Peloton Bike+   
7              Nest Thermostat (3rd Gen)   
8         L'Oréal Paris Revitalift Serum   
9    Kindle Paperwhite Signature Edition   

                     predicted_category       gold_category  match  \
0                       vacuum cleaners      vacuum cleaner  False   
1               small kitchen appliance        coffee maker  False   
2                            smartphone          smartphone   True   
3                  drinkware / tumblers           drinkware  False   
4      headphones / wireless headphones          headphones  False   
5                            snack food               snack  False   
6                     fit