#  Workshop Module: LLM Testing Strategies for Business Research

## LLM Testing
- Correctness & Accuracy
  * **Unit tests**: Check exact match against expected answers.
  * **Classification metrics**: Compare predicted labels to ground truth (e.g. accuracy, F1).
  * **Factuality checks**: Validate claims against external sources; flag hallucinations.

- Consistency & Stability
  * **Regression testing**: Detect output changes over time or between model versions.
  * **Prompt sensitivity**: Assess how small wording or formatting changes affect results.
  * **Reproducibility**: Ensure the same prompt yields the same output under controlled conditions.

- Structure & Formatting
  * **Output structure enforcement**: Validate that responses follow required formats (e.g., bullet points, tables, JSON).
  * **Length control**: Ensure response respect word limits.




### Case 1: Extract infrmation from earning report
* Defines a "golden set" of expected phrases (like unit test assertions).
* Checks that the LLM includes all critical facts.
* Gives a clean pass/fail result with explanation.


In [27]:

from openai import OpenAI
import os
from pathlib import Path
from typing import List

def llm_openai(prompt: str, llm_model: str) -> str:
    """Call OpenAI ChatCompletion API and return output text."""

    # Set your OpenAI API key as an environment variable before running
    api_key = Path("./openai.key").read_text().strip()
    client = OpenAI(api_key=api_key)

    completion = client.chat.completions.create(
        model=llm_model,
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0,
    )   
    return completion.choices[0].message.content


In [None]:
import datetime
import json
import os

def llm_with_logging(prompt, llm_model, llm_func):
    """
    Executes an LLM prompt using a provided function, saves the prompt and logs the interaction.
    Args:
        prompt (str): The prompt to send to the LLM.
        llm_model (str): The name of the LLM model to use.
        llm_func (callable): The function to execute the LLM call. Defaults to llm_execute.
    Returns:
        str: The output from the LLM.
    """
    log_dir = "logs"
    # prompts_dir = "prompts"
    
    os.makedirs(log_dir, exist_ok=True)
    # os.makedirs(prompts_dir, exist_ok=True)
    
    now = datetime.datetime.now()
    date_str = now.date().isoformat()
    timestamp_str = now.strftime("%Y%m%d-%H%M%S")
    
    output_file = os.path.join(log_dir, f"logfile_{date_str}.jsonl")
    # prompt_file = os.path.join(prompts_dir, f"prompt_{timestamp_str}.txt")

    # with open(prompt_file, "w") as f:
    #     f.write(prompt)
    # print(f"Saved prompt to {prompt_file}")

    output = llm_func(prompt, llm_model)

    log_entry = {
        "timestamp": now.isoformat(),
        "model": llm_model,
        "prompt": prompt,
        "output": output,
    }

    with open(output_file, "a") as f:
        f.write(json.dumps(log_entry) + "\n")

    print(f"Logged interaction to {output_file}")
    return output


In [29]:
def test_summary_quality(output: str, expected_phrases: List[str]) -> bool:
    """Unit-style test: check if all expected phrases appear in the output."""
    output = output.lower()
    expected_phrases = [phrase.lower() for phrase in expected_phrases]
    missing = [phrase for phrase in expected_phrases if phrase not in output]
    if missing:
        print("=== Test failed. Missing key points:")
        for m in missing:
            print(f" - {m}")
        return False
    print("=== Test passed. Output contains all expected key points.")
    return True

In [30]:

# --- Realistic business prompt for LLM ---
test_prompt = """
Summarize the following Q1 2024 earnings report for internal strategy briefings. Focus on revenue, profit, growth drivers, and geographic performance:

'In its first quarter of fiscal 2024, Starbucks reported consolidated net revenues of $9.4 billion, representing an 8% increase over the prior year. Net income grew to $1.1 billion, or $0.78 per share, compared to $0.68 per share in Q1 2023. The North America segment saw a 9% revenue increase, fueled by higher average ticket and increased store traffic. International revenue rose 7%, with particularly strong performance in China where same-store sales jumped 11%. Starbucks opened 549 new stores globally in the quarter, bringing its total to over 38,000. CEO Laxman Narasimhan cited continued investment in digital platforms and mobile ordering as a key competitive advantage. Operating margin expanded 40 basis points year-over-year. The company reaffirmed its full-year guidance of high-single-digit revenue growth.'
"""

# --- Define expected content for unit test ---
expected_phrases = [
    "$9.4 billion",
    "$1.1 billion",
    "North America",
    "China",
    # "549 new stores",
    # "digital platforms and mobile ordering",
    # "operating margin"
]

# --- Run the test ---
llm_model = "gpt-4.1-nano"
output = llm_with_logging(test_prompt, llm_model, llm_openai)
print("\n📤 LLM Output:\n", output)
print("\n🔍 Running Unit Test...\n")
test_summary_quality(output, expected_phrases)

Saved prompt to prompts/prompt_20250618-144651.txt
Logged interaction to logs/logfile_2025-06-18.jsonl

📤 LLM Output:
 Q1 2024 Earnings Summary for Internal Strategy Briefing:

**Revenue and Profit:**
- Consolidated net revenues reached $9.4 billion, up 8% year-over-year.
- Net income increased to $1.1 billion, with EPS of $0.78, compared to $0.68 in Q1 2023.
- Operating margin expanded by 40 basis points, indicating improved profitability.

**Growth Drivers:**
- Strong performance driven by higher average tickets and increased store traffic, particularly in North America.
- Continued investment in digital platforms and mobile ordering enhanced customer engagement and sales.
- Opening of 549 new stores globally, expanding the total footprint to over 38,000 locations.

**Geographic Performance:**
- North America: Revenue grew 9%, supported by increased customer visits and higher spend per visit.
- International: Revenue rose 7%, with China leading international growth—same-store sales u

True

### Case 2: Generate marketing materials

* Track changes in model output over time or when prompts/models change.

In [31]:
test_prompt = """Write a compelling headline for an email marketing campaign promoting a new rewards credit card for young professionals. Focus on travel perks, no annual fee, and cash back."""

expected_phrases = ["travel", "no annual fee", "cash back"]

# Run on two model versions
llm_model = "gpt-4.1-nano"
output = llm_with_logging(test_prompt, llm_model, llm_openai)
print("\n📤 LLM Output:\n", output)
test_summary_quality(output, expected_phrases)

llm_model = "gpt-3.5-turbo"
output = llm_with_logging(test_prompt, llm_model, llm_openai)
print("\n📤 LLM Output:\n", output)
test_summary_quality(output, expected_phrases)


Saved prompt to prompts/prompt_20250618-144653.txt
Logged interaction to logs/logfile_2025-06-18.jsonl

📤 LLM Output:
 Unlock Your Adventure: No Fee, Big Rewards — Travel Perks & Cash Back for Young Professionals!
=== Test failed. Missing key points:
 - no annual fee
Saved prompt to prompts/prompt_20250618-144653.txt
Logged interaction to logs/logfile_2025-06-18.jsonl

📤 LLM Output:
 "Unlock Your Passport to Adventure with Our New Rewards Credit Card - No Annual Fee, Cash Back, and Travel Perks Await!"
=== Test passed. Output contains all expected key points.


True

### Case 3: Check classification outcome

* Adopt embedding to compare similarity 

In [32]:

from openai import OpenAI
def get_embedding_openai(text):
    """
    Get embedding vector for a category string using OpenAI embeddings API.
    """
    api_key = Path("./openai.key").read_text().strip()
    client = OpenAI(api_key=api_key)

    # Use a small model for speed/cost; adjust as needed
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )

    return response.data[0].embedding

def cosine_similarity(vec1, vec2):
    """
    Compute cosine similarity between two vectors.
    """
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))


In [33]:
import pandas as pd
import re
import numpy as np

#  Evaluation Function
def evaluate_product_classification(df, llm_model, prompt):
    """
    Send each product name to LLM and compare output category to golden label.
    Adds columns for LLM response, predicted category, match status, and error type.
    """
    results = []

    for _, row in df.iterrows():
        product = row["product_name"]
        gold_category = row["ground_truth_category"]

        user_prompt = f"{prompt} {product}\n"

        llm_output = llm_with_logging(user_prompt, llm_model, llm_openai)

        # Try to extract category from LLM output
        category_match = re.search(r"Category:\s*(.*)", llm_output)
        predicted_category = category_match.group(1).strip().lower() if category_match else "N/A"

        match = predicted_category == gold_category.lower()

        # Get embeddings and similarity
        pred_emb = get_embedding_openai(predicted_category)
        gold_emb = get_embedding_openai(gold_category)
        similarity = cosine_similarity(pred_emb, gold_emb)

        results.append({
            "product_name": product,
            "gold_category": gold_category,
            "llm_response": llm_output,
            "predicted_category": predicted_category,
            "match": match,
            "error_type": "" if match else "misclassification",
            "cosine_similarity": similarity
        })

    return pd.DataFrame(results)



In [34]:
df_products = pd.DataFrame({
    "product_name": [
        "Dyson V11 Torque Drive",
        "Keurig K-Supreme Plus SMART",
        "iPhone 15 Pro Max",
        "YETI Rambler 20 oz Tumbler",
        "Sony WH-1000XM5",
        "Blue Diamond Almonds - Lightly Salted",
        "Peloton Bike+",
        "Nest Thermostat (3rd Gen)",
        "L'Oréal Paris Revitalift Serum",
        "Kindle Paperwhite Signature Edition"
    ],
    "ground_truth_category": [
        "vacuum cleaner",
        "coffee maker",
        "smartphone",
        "drinkware",
        "headphones",
        "snack",
        "exercise equipment",
        "smart home device",
        "skincare",
        "e-reader"
    ]
})

test_prompt = """Classify the following product into a specific consumer product category. 
Reply in the format: 
Category: <your category>
Reason: <your reasoning based on the product name>
Product: """

llm_model = "gpt-4.1-nano"
df_results = evaluate_product_classification(df_products, llm_model, test_prompt)

Saved prompt to prompts/prompt_20250618-144654.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144655.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144657.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144658.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144659.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144700.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144702.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144703.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144706.txt
Logged interaction to logs/logfile_2025-06-18.jsonl
Saved prompt to prompts/prompt_20250618-144707.txt
Logged interaction to 

In [35]:
df_results.to_csv("product_classification_results.csv", index=False)
print(df_results[["product_name", "predicted_category", "gold_category", "match","cosine_similarity"]])

                            product_name  \
0                 Dyson V11 Torque Drive   
1            Keurig K-Supreme Plus SMART   
2                      iPhone 15 Pro Max   
3             YETI Rambler 20 oz Tumbler   
4                        Sony WH-1000XM5   
5  Blue Diamond Almonds - Lightly Salted   
6                          Peloton Bike+   
7              Nest Thermostat (3rd Gen)   
8         L'Oréal Paris Revitalift Serum   
9    Kindle Paperwhite Signature Edition   

                     predicted_category       gold_category  match  \
0                       vacuum cleaners      vacuum cleaner  False   
1              small kitchen appliances        coffee maker  False   
2                            smartphone          smartphone   True   
3                  drinkware / tumblers           drinkware  False   
4          headphones / audio equipment          headphones  False   
5                      food & beverages               snack  False   
6                     fit