# The Challenge of Evaluating LLM-based Applications
```{epigraph}
Evals are surprisingly often all you need.

-- Greg Brockman, OpenAI's President
```
```{contents}
:depth: 2
```
## Non-Deterministic Machines

One of the most fundamental challenges when building products with Large Language Models (LLMs) is their non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate different responses each time they're queried - even with identical prompts and input data. This characteristic is both a strength and a significant engineering challenge.

When you ask ChatGPT or any other LLM the same question multiple times, you'll likely get different responses. This isn't a bug - it's a fundamental feature of how these models work. The "temperature" parameter, which controls the randomness of outputs, allows models to be creative and generate diverse responses. However, this same feature makes it incredibly difficult to build reliable, testable systems.

Consider a financial services company using LLMs to generate investment advice. The non-deterministic nature of these models means that:
- The same market data could yield different analysis conclusions
- Testing becomes exceedingly more complex compared to traditional software
- Regulatory compliance becomes challenging to guarantee
- User trust may be affected by inconsistent responses

### Temperature and Sampling

The primary source of non-determinism in LLMs comes from their sampling strategies. During text generation, the model:
1. Calculates probability distributions for each next token
2. Samples from these distributions based on temperature settings
3. Uses techniques like nucleus sampling to balance creativity and coherence

### The Temperature Spectrum

- Temperature = 0: Most deterministic, but potentially repetitive
- Temperature = 1: Balanced creativity and coherence
- Temperature > 1: Increased randomness, potentially incoherent

In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

from openai import OpenAI
import pandas as pd
from typing import List

def generate_responses(
    model_name: str,
    prompt: str,
    temperatures: List[float],
    attempts: int = 3
) -> pd.DataFrame:
    """
    Generate multiple responses at different temperature settings
    to demonstrate non-deterministic behavior.
    """
    client = OpenAI()
    results = []
    
    for temp in temperatures:
        for attempt in range(attempts):
            response = client.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=temp,
                max_tokens=50
            )
            
            results.append({
                'temperature': temp,
                'attempt': attempt + 1,
                'response': response.choices[0].message.content
            })

    # Display results grouped by temperature
    df_results = pd.DataFrame(results)
    for temp in temperatures:
        print(f"\nTemperature = {temp}")
        print("-" * 40)
        temp_responses = df_results[df_results['temperature'] == temp]
        for _, row in temp_responses.iterrows():
            print(f"Attempt {row['attempt']}: {row['response']}")
    
    return df_results

In [3]:
MAX_LENGTH = 10000 # We limit the input length to avoid token issues
with open('../data/apple.txt', 'r') as file:
    sec_filing = file.read()
sec_filing = sec_filing[:MAX_LENGTH] 
df_results = generate_responses(model_name="gpt-3.5-turbo", 
                                prompt=f"Write a single-statement executive summary of the following text: {sec_filing}", 
                                temperatures=[0.0, 1.0, 2.0])


Temperature = 0.0
----------------------------------------
Attempt 1: Apple Inc. filed its Form 10-K for the fiscal year ended September 28, 2024 with the SEC, detailing its business operations and financial performance.
Attempt 2: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, detailing its business operations, products, and financial information.
Attempt 3: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, detailing its business operations, products, and financial information.

Temperature = 1.0
----------------------------------------
Attempt 1: Apple Inc., a well-known seasoned issuer based in California, designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories, with a focus on innovation and technology.
Attempt 2: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, reporting on its business operations, products, an

This simple experiment reveals a fundamental challenge in LLM evaluation: even a simple parameter like temperature can dramatically alter model behavior in ways that are difficult to systematically assess. At temperature 0.0, responses are consistent but potentially too rigid. At 1.0, outputs become more varied but less predictable. At 2.0, responses can be wildly different and often incoherent. This non-deterministic behavior makes traditional software testing approaches inadequate.

The implications for evaluation are profound. How can one effectively test an LLM-powered system when the same prompt can yield radically different outputs based on a single parameter? Traditional testing relies on predictable inputs and outputs, but LLMs force us to grapple with probabilistic behavior. While lower temperatures may seem safer for critical applications, they don't eliminate the underlying uncertainty - they merely mask it. This highlights the need for new evaluation paradigms that can handle both deterministic and probabilistic aspects of LLM behavior.



## Emerging Properties

Beyond their non-deterministic nature, LLMs present another fascinating challenge: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren't explicitly programmed but rather emerge "naturally" as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against clear specifications.

```{figure} ../_static/evals/emerging.png
---
name: emerging-properties
alt: Emerging Properties
class: bg-primary mb-1
scale: 60%
align: center
---
Emergent abilities of large language models and the scale {cite}`wei2022emergentabilitieslargelanguage`.
```

 {numref}`emerging-properties` provides a list of emergent abilities of large language models and the scale. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.

The implications for evaluation are profound. While conventional software testing relies on stable test suites and well-defined acceptance criteria, LLM evaluation must contend with a constantly shifting landscape of capabilities. What worked to evaluate a 7B parameter model may be completely inadequate for a 70B parameter model that has developed new emergent abilities. This dynamic nature of LLM capabilities forces us to fundamentally rethink our approach to testing and evaluation.

## Problem Statement

Consider a practical example that illustrates these challenges: building a customer support chatbot powered by an LLM. In traditional software development, you would define specific features (like handling refund requests or tracking orders) and write tests to verify each function. But with LLMs, you're not just testing predefined features - you're trying to evaluate emergent capabilities like understanding context, maintaining conversation coherence, and generating appropriate emotional responses.

This fundamental difference raises critical questions about evaluation:
- How do we measure capabilities that weren't explicitly programmed?
- How can we ensure consistent performance when abilities may suddenly emerge or evolve?
- What metrics can capture both the technical accuracy and the subjective quality of responses?

The challenge becomes even more complex when we consider that traditional software evaluation methods simply weren't designed for these kinds of systems. We need new frameworks that can account for both the deterministic aspects we're used to testing and the emergent properties that make LLMs unique. Let's explore how LLM evaluation differs from traditional software testing across several key dimensions:
- **Capability Assessment vs Functional Testing**: Traditional software testing validates specific functionality against predefined requirements. LLM evaluation must assess not necessiraly pre-defined "emergent properties" like reasoning, creativity, and language understanding that extend beyond explicit programming.

- **Metrics and Measurement Challenges**: While traditional software metrics can usually be precisely defined and measured, LLM evaluation often involves subjective qualities like "helpfulness" or "naturalness" that resist straightforward quantification. Even when we try to break these down into numeric scores, the underlying judgment remains inherently human and context-dependent.

- **Dataset Contamination**: Traditional software testing uses carefully crafted test cases with known inputs and expected outputs (e.g., unit tests, integration tests). In contrast, LLMs trained on massive internet-scale datasets risk having already seen and memorized evaluation examples during training, which can lead to artificially inflated performance scores. This requires careful dataset curation to ensure test sets are truly unseen by the model and rigorous cross-validation approaches.

- **Benchmark Evolution**: Traditional software maintains stable test suites over time. LLM benchmarks continuously evolve as capabilities advance, making longitudinal performance comparisons difficult and potentially obsoleting older evaluation methods.

- **Human Evaluation Requirements**: Traditional software testing automates most validation. LLM evaluation may demand significant human oversight to assess output quality, appropriateness, and potential biases through structured annotation and systematic review processes.

```{table} Evals of Traditional Software vs LLMs
:name: evals-table
| Aspect                                      | Traditional Software                             | LLMs                                                                                     |
|---------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------|
| **Capability Assessment**          | Validates specific functionality against requirements | May assess emergent properties like reasoning and creativity                                      |
| **Metrics and Measurement**                             | Precisely defined and measurable metrics                     | Subjective qualities that resist straightforward quantification                                                      |
| **Dataset Contamination**                             | Uses carefully crafted test cases                   | Risk of memorized evaluation examples from training                                                          |
| **Benchmark Evolution**                              | Maintains stable test suites                                 | Continuously evolving benchmarks as capabilities advance                                                 |
| **Human Evaluation**                        | Mostly automated validation                                     | May require significant human oversight                                                        |
```

## Solutions

### Evals Design

First, it's important to make a distinction between evaluating an LLM versus evaluating an LLM-based application (our focus). While the latter offers foundation capabilities and are typically general-purpose, the former is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.

That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications are evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:

1. Application requirements are closely tied to LLM evaluations
2. The same LLM can yield different results in different applications
3. Evaluation must align with business objectives
4. A great LLM doesn't guarantee a great application!


#### Conceptual Overview

When evaluating an LLM-based application, we need to consider the following components:

Examples, Application, Evaluator, Score


Let me break down the key components, their inputs/outputs, and purposes from the diagram:

1. Examples/Dataset (Input Source):
- Purpose: Provides standardized test cases for evaluation
- Input: Collection of test cases
- Output: Test inputs fed to multiple LLM applications
- Optional Connection to Evaluator: Can provide reference/ground truth for comparison

2. LLM Applications (Processing Layer):
- Input: Test cases from Examples
- Processing: Each LLM (LLM_1, LLM_2, ... LLM_N) processes the same inputs
- Output: Generated responses/results
- Purpose: 
  * Represents different LLM implementations/vendors
  * Could be different models (GPT-4, Claude, PaLM, etc.)
  * Could be different configurations of same model
  * Could be different prompting strategies

3. Evaluator (Assessment Layer):
- Input: 
  * LLM outputs from all applications
  * Optional reference data from Examples
- Processing: Applies evaluation metrics and scoring criteria
- Output: Individual scores for each LLM application
- Purpose:
  * Measures performance across defined metrics
  * Ensures consistent evaluation across all LLMs
  * Applies standardized scoring criteria

4. Scores (Metric Layer):
- Input: Evaluation results from Evaluator
- Output: Quantified performance metrics
- Purpose:
  * Represents performance in numerical form
  * Enables quantitative comparison
  * May include multiple metrics per LLM

5. Leaderboard (Ranking Layer):
- Input: Scores from all LLM applications
- Processing: Aggregates and ranks performances
- Output: Ordered ranking of LLMs with scores
- Purpose:
  * Provides clear comparison view
  * Shows relative performance
  * Helps in decision-making

The flow demonstrates a systematic approach where:
1. Same inputs are provided to all LLMs
2. Responses are evaluated consistently
3. Performance is quantified objectively
4. Results are ranked for easy comparison

Key aspects of the design:
- Scalability: Can handle many LLMs (shown by ...)
- Fairness: Same inputs and evaluation criteria for all
- Transparency: Clear flow from input to final ranking
- Modularity: Components can be updated independently
- Standardization: Consistent evaluation process

Would you like me to elaborate on any particular component or aspect of the system?



```{figure} ../_static/evals/conceptual.png
---
name: conceptual
alt: Conceptual Overview
scale: 40%
align: center
---
Conceptual overview of LLM-based application evaluation.
```

{numref}`conceptual`


```{figure} ../_static/evals/conceptual-multi.svg
---
name: conceptual-multi
alt: Conceptual Overview
scale: 50%
align: center
---
Conceptual overview of Multiple LLM-based applications evaluation.
```

{numref}`conceptual-multi`


#### Considerations

Let me break down the key conceptual aspects and important questions for planning an LLM application evaluation system:

1. Examples/Dataset Design:
- What types of examples should be included in the test set?
  * Does it cover all important use cases?
  * Are edge cases represented?
  * Is there a good balance of simple and complex examples?
- How do we ensure data quality?
  * Are the examples representative of real-world scenarios?
  * Is there any bias in the test set?
- Should we have separate test sets for different aspects (accuracy, safety, etc.)?
- Do we need human-validated ground truth for all examples?

2. LLM Applications:
- What aspects of each LLM app should be standardized for fair comparison?
  * Prompt templates
  * Context length
  * Temperature and other parameters
  * Rate limiting and timeout handling
- How to handle different LLM capabilities and limitations?
  * Some models might have special features others don't
  * Cost and latency differences
  * Different output formats
- Should we test different configurations of the same LLM?

3. Evaluator Design:
- What metrics should we measure?
  * Accuracy/correctness
  * Response relevance
  * Output consistency
  * Response latency
  * Cost efficiency
  * Safety and bias metrics
- How do we define success for different types of tasks?
  * Objective metrics vs subjective assessment
  * Task-specific evaluation criteria
  * Handling partial correctness
- Should evaluation be automated or involve human review?
  * Balance between automation and human judgment
  * Inter-rater reliability for human evaluation
  * Cost and scalability considerations

4. Scoring System:
- How should different metrics be weighted?
  * Relative importance of different factors
  * Task-specific prioritization
  * Business requirements alignment
- Should scores be normalized or absolute?
- How to handle missing capabilities or failed responses?
- Should we consider confidence scores from the LLMs?

5. Leaderboard/Ranking:
- How often should rankings be updated?
- Should ranking include confidence intervals?
- How to handle ties or very close scores?
- Should we maintain separate rankings for different:
  * Task types
  * Cost tiers
  * Performance characteristics

6. Overall System Design:
- How to ensure evaluation system scalability?
- How to maintain test set security?
- How to handle API changes and versioning?
- How to validate the evaluation system itself?
- How to make the system extensible for new:
  * Metrics
  * LLM providers
  * Use cases
  * Evaluation methods

7. Practical Considerations:
- Budget constraints for running evaluations
- API rate limits and quotas
- Maintenance and monitoring requirements
- Documentation and reproducibility
- Legal and compliance requirements
- Disaster recovery and backup plans

8. Business Integration:
- How will evaluation results inform business decisions?
- What is the update frequency needed?
- How to handle vendor selection and migration?
- What level of transparency is needed in reporting?

This evaluation framework allows organizations to:
1. Systematically compare different LLM solutions
2. Make data-driven decisions about LLM selection
3. Monitor performance over time
4. Identify areas for improvement
5. Manage costs and risks effectively

The key is to design an evaluation system that is:
- Comprehensive yet practical
- Fair and unbiased
- Scalable and maintainable
- Aligned with business objectives
- Adaptable to evolving requirements

Would you like me to elaborate on any of these aspects or explore additional considerations?

### Approaches

### Human-Based Evaluation

### Metrics-Based Evaluation

A metrics-based approach enables automated benchmarking for evaluating LLM performance on specific tasks and capabilities. It provides a quantifiable and repeatable way to measure progress and identify areas for improvement. This is particularly useful for well-defined tasks, such as spam classification, data extraction or translation, where clear and objective evaluation criteria can be established. 

The core approach involves using pre-existing datasets (golden datasets) and establishing objective metrics to evaluate model performance. The process typically involves the following steps:

1.  **Selecting a relevant benchmark dataset:** The choice of dataset depends on the specific task or capability being evaluated.  For example, the HumanEval dataset is used to evaluate code generation capabilities, while ChartQA focuses on chart understanding.
2.  **Providing input to the LLM:**  The LLM is given input from the selected dataset, prompting it to perform the specific task, such as answering questions, generating text, or translating languages.
3.  **Comparing outputs to expected answers:** The LLM's outputs are compared to the expected or correct answers provided in the benchmark dataset.
4.  **Quantifying the comparison using metrics:** The comparison is quantified using pre-defined metrics relevant to the task, providing a numerical score that reflects the LLM's performance.  For instance, accuracy, precision, and recall are common metrics for classification tasks.

 The LLM is given input from the dataset, and its outputs are compared to expected or correct answers. The comparison is quantified using specific metrics relevant to the task. This approach enables efficient and automated evaluation, allowing for large-scale comparisons and tracking of progress over time.

## References
```{bibliography}
:filter: docname in docnames
```