# Workshop: Evaluating LLM based applications

It is so easy and quick to build a shiny PoC using LLMs and it is so hard to turn it into a production-grade LLM application. To succeed you need a robust evaluation framework, which you are going to use during the development and post-deployment of your LLM based app.

This workshop consists of 4 main parts:
* evaluation-driven development and architecture of a LLM based app
* evaluation framework for a LLM based app
* test suite with evals for a LLM based app 
* monitoring foundations for a LLM based app

**About workshop giver**

With over 19 years of experience in Data and AI, [Una Galyeva](https://www.linkedin.com/in/unagalyeva/) held various positions, from hands-on Data and AI development to leading Data and AI teams and departments. As a driving force behind [PyLadies Amsterdam](https://amsterdam.pyladies.com/), a Microsoft MVP, Women in AI Benelux Advisory board member, and the owner of AI MLOps Agency, Una is passionate about challenging perspectives and inspiring others to see things differently.

## Evaluation-driven development and architecture of a LLM based app


### Evaluation-driven development

**Evaluation-driven development** is a methodology to guide the development of LLM based apps via a set of task-specific evaluations. This term is inspired by test-driven development in software engineering. 

![Evaluation Driven Development Workflow](../assets/EDD.png)

*Source: [Evaluation-driven development workflow](https://docs.databricks.com/en/generative-ai/tutorials/ai-cookbook/evaluation-driven-development.html) by Databricks*

### Architecture

A LLM based app architecture can be quite complex. Usually you start small and step by step extend it by adding new components.
* basic LLM app: a user query is passed directly to a LLM model, its response is passed back to the user
* enhanced context: LLM model is given access to external data and tools for creating more informed responses
* guardrails in place to protect both the LLM app and its users
* model router and gateaway to support complex pipelines and enhance security
* performance optimization to reduce latency and costs with caching
* agent patterns to incorporate complex logic and actions to maximize the LLM app capabilities

![Architecure of a LLM based app](../assets/LLM_app_arch.png)

*Source: Chapter 10 of [AI Engineering](https://www.oreilly.com/library/view/ai-engineering/9781098166298/) by Chip Huyen*


## Evaluation framework for a LLM based app

**Why it is important? How to convince stakeholders?**

An evaluation system for your LLM based app is a crucial investment, similar to quality assurance and safety testing for regular software. It ensures that investments in your app (such as cloud costs and manpower) yield desired results like cost reduction or revenue generation. In addition, it helps to retain users and avoid negative consequences like PR disasters, guarantees safety and mitigates risks. Furthermore, evaluations are vital for compliance with emerging AI regulations and for maintaining team efficiency by enabling faster iterations, smoother updates, and effective debugging. Even at the PoC stage, evaluations are essential to demonstrate real value beyond a simple demo.

**What to evaluate?**

Real-world LLM based apps are complex. They often involve numerous interconnected components and multi-turn interactions. Evaluation should consider multiple levels: quality of individual turns, task completion and also intermediate outputs. It's important to assess both the final outcome of the entire app and the performance of each component independently.

**How to create an evaluation framework?**

When creating the evaluation framework, it's crucial to define what your app shouldn't do next to a desired app functionality. For instance, when developing a customer support chatbot, it's important to define what is an off-topic request, such as inquiries about writing some Python code. This involves determining how to identify out-of-scope inputs and establishing appropriate responses to them.
1. Define evaluation criteria (defining what "good" means)
2. Create scoring rubrics with examples (choose a scoring system, develop a rubric with examples, validate it with human feedback)
3. Tie evaluation metrics to business metrics (consider them in the context of business problem it's built to solve)
4. Select evaluation methods 
5. Annotate evaluation data
6. Evaluate your evaluation process
7. Iterate

Some of the evaluation methods require ground truth and some of them not. You can choose among traditional predictive metrics (accuracy, precision, recall, etc.), generative metrics (BLEU, ROUGE, etc.), ML-based (semantic similarity, sentiment, toxicity, etc.), LLM-as-a-judge for nuanced evaluations, and simpler methods like regular expressions and text statistics.

Data can be sourced from manually created test cases, existing user data, beta-user feedback, public benchmarks (e.g safety), and synthetic data. The latter can be generated by LLMs (plausible inputs, input-output pairs useful for RAG systems, diversified inputs). 

It is a good idea to organize your eval datasets in three groups:
1. happy path scenarios (typical user queries)
2. edge cases (plausible but challenging inputs)
3. adversarial scenarios (unsafe or malicious inputs) 

Building and maintaining these datasets is an ongoing process crucial for measuring improvement and avoiding guesswork in LLM based app development. 

**Which Python OSS libraries can help me with LLM based app evaluation and monitoring?**

Take a look at MLflow, AdalFlow, Evidently, Opik, DeepEval, Ragas, LangChain and more.


### Use case for this workshop

An internal company LLM based app - Q&A system created to answer employees' questions about HR, finance, etc.

Q&A app capabilities:
* answers must be correct
* answers must be helpful, complete and relevant
* answer tone must be professional and match the company brand style
* format less than 100 words, must include the link to the source

Q&A app should mitigate the following risks:
* hallucinations
* bias
* toxicity
* off-topic use
* data leakage 

### Setup and dataset prep

For this workshop you will use [Evidently](https://docs.evidentlyai.com/) - an open-source Python library for ML and LLM evaluation and observability. It helps evaluate, test, and monitor AI-powered systems and data pipelines from experimentation to production. Main features are:
* works with tabular, text data, and embeddings.
* supports predictive and generative systems, from classification to RAG
* 100+ built-in metrics from data drift detection to LLM judges
* Python interface for custom metrics and tests
* both offline evals and live monitoring
* open architecture: easily export data and integrate with existing tools

Evidently is very modular. You will start with one-off evaluations using Reports (compute various data, ML and LLM quality metrics) and continue with Test Suites (check for defined conditions on metric values and return a pass or fail result) in Python. You can export the evaluation results to a data frame, Python dictionary, JSON and HTML.

In [1]:
import pandas as pd
import requests
from evidently import ColumnMapping
from evidently.descriptors import *
from evidently.metric_preset import TextEvals
from evidently.metrics import *
from evidently.report import Report
from evidently.test_suite import TestSuite
from evidently.tests import *
from io import BytesIO

Import the QA dataset using requests library, convert into pandas data frame, parse dates and set *start_time* as an index.

In [4]:
response = requests.get("https://raw.githubusercontent.com/pyladiesams/eval-llm-based-apps-jan2025/main/assets/QA.csv")
qa_csv_content = BytesIO(response.content)
qa_logs = pd.read_csv(qa_csv_content, index_col=0, parse_dates=['start_time', 'end_time'])
qa_logs.index = qa_logs.start_time
qa_logs.index.rename('index', inplace=True)

Familiarize yourself with the dataset by getting a preview of its first three rows.

In [None]:
pd.set_option('display.max_colwidth', None)
qa_logs.head(3)

While working with Evidently it is highly recommended to map the data schema to make sure that it is parsed correctly. To handle this, create a column mapping by identifying the type of columns and pointing to a *datetime* column for adding a time index to your plots.

In [6]:
column_mapping = ColumnMapping(
    datetime='start_time',
    datetime_features=['end_time'],
    text_features=['question', 'response'],
    categorical_features=['organization', 'model_ID', 'region', 'environment', 'feedback'],
)


Let's dive into applying different evaluation methods by leveraging Evidently built-in **descriptors** (each evaluation that computes a score for every text in the dataset). We will start with the simplest and gradually progress towards more complex ones.

### Text statistics

Computes descriptive text statistics by evaluating simple properties like text length, sentence count, word count, percentage of out-of-vocabulary words, percentage of non-letter characters.

**Evaluate text length**

Generate a Report to evaluate the length of each text in the *response* column. Run this check for the first 200 rows of the *qa_logs* dataframe.

This calculates the number of symbols in each text and shows a summary. You can see the distribution of the text length across all responses and descriptive statistics like the mean or minimal text length. 

Click on Details to see how the mean text length changes over time. The index comes from the *datetime* column you mapped earlier. This helps you to notice any temporal patterns, such as if texts are longer or shorter during specific periods.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Get a side-by-side comparison**. 

You can also generate statistics for two datasets at once. For example, compare the outputs of two different prompts or data from today against yesterday.

Pass one dataset as a reference one and another as a current one. For simplicity, let's compare the text length for the first and next 100 responses from the same dataframe:

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=qa_logs[:100],
                      current_data=qa_logs[100:200],
                      column_mapping=column_mapping)
text_evals_report

**Exercise 1**

Get a side by side comparison of the sentence count for the first and the next 100 rows from the same dataframe. Use SentenceCount() descriptor.
Consult the descriptor docs [here](https://docs.evidentlyai.com/reference/all-metrics#descriptors-text-stats)

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  ,
                  ]
              )
])

text_evals_report.run(reference_data=qa_logs[:100],
                      current_data=qa_logs[100:200],
                      column_mapping=column_mapping)
text_evals_report

### Text patterns

Detect specific words or regular patterns by using regular expressions behind the scenes. Such evals are faster and cheaper to compute at scale. 

For example, check if the responses mention competitors, banned or forbidden words, include emails or other links. Text pattern descriptors return a binary score ("True" or "False") for pattern matches.

Let's check if *responses* contains specific words related to the compensation (such as salary, benefits, or payroll). Pass this word list to the IncludesWords() descriptor. This will also check for word variants. Add an optional display name for this eval.

You can see that 18 responses out of 200 relate to the topic of compensation as defined by this word list. Details show occurrences in time.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  IncludesWords(
                      words_list=['salary', 'benefits', 'payroll'],
                      display_name="Mention Compensation")
            ]
        ),
        ]
)

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Exercise 2**

Check whether the first 200 responses contain links or not. Add "Contains links" display name for this eval. Find an appropriate descriptor for your task [here](https://docs.evidentlyai.com/reference/all-metrics#descriptors-text-patterns)

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
            ]
        )
        ]
)

text_evals_report.run(reference_data=,
                      current_data=,
                      column_mapping=column_mapping)
text_evals_report

### ML-based evaluation

Uses pre-trained machine learning models for evaluation. 

Evidently has built-in model-based descriptors and wrappers to call external models published on Hugging Face.

**Semantic similarity**

You can evaluate how closely two texts are in meaning using an embedding model. SemanticSimilarity() descriptor calculates pairwise semantic similarity between columns for each pair of text. You can compare the text from *Response* and *Question* columns to see if the answers are semantically relevant to the question.

SemanticSimilarity() descriptor converts all texts into embeddings, measures Cosine Similarity between them, and returns a score from 0 to 1:
* 0 means that texts are opposite in meaning;
* 0.5 means that texts are unrelated;
* 1 means that texts are semantically close.

In this case, the semantic similarity always stays above 0.81, which means that answers generally relate to the question.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        SemanticSimilarity(with_column="question", 
                           ),
    ])
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Sentiment analysis**

Analyzes the sentiment of the text using a word-based model. Returns score on a scale: -1 (negative) to 1 (positive). Shows the distribution of the response sentiment. Allows you to spot specific times when the average sentiment of the responses dipped.

**Exercise 3**

Execute a sentiment check on the first 200 responses


In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            ,
        ]
    ),
])

text_evals_report.run(reference_data=,
                      current_data=,
                      column_mapping=column_mapping)
text_evals_report

**Toxicity**

The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model. In this model, 'hate' is defined as abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation. This model returns a predicted toxicity score between 0 and 1. First, the descriptor downloads the model from Hugging Face to your environment. Then uses it to score the data. It takes a few moments to load the model. The higher the score the more toxic is your response.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            HuggingFaceToxicityModel(),
        ]
    ),
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

### LLM-as-a-judge

For more complex or nuanced checks, you can use LLM-as-a-judge. This requires asking the same or more powerful LLM to assess the text by specific criteria, such as tone or conciseness. Please keep in mind that it is not a precise metric but rather an approximation of human judgment, requiring clear instructions and potentially manual labeling to establish evaluation criteria. This method is well suited for pairwise comparison, direct scoring with reference. 

**How to create LLM-as-a-judge?**
1. Create a test dataset
2. Add your own labels and comments (helps to clarify your own criteria)
3. Write an eval prompt (use binary or low-precision scores, split complex criteria, ask for reasoning)
4. Evaluate the LLM-as-a-judge (to iterate get back to the step 3)
5. Deploy the evaluator

**Limitations of LLM-as-a-judge**
* inconsitency
* criteria ambiguity
* increased costs and latency
* biases of LLM-as-a-judge

Evidently has built-in descriptors and prompt templates to create your own custom LLM-as-a-judge (it requires an OpenAI API key). More information could be found in Further resources section at the end of this notebook.


### Metadata summary

The QA dataset has a *feedback* column which includes user upvotes and downvotes. You can easily enrich your Report with summaries from any numerical or categorical columns.

Use ColumnSummaryMetric() to add a summary of the *feedback* column.

In [None]:
feedback_report = Report(metrics=[
   ColumnSummaryMetric(column_name="feedback"),
   ]
)

feedback_report.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
feedback_report

## Test suite with evals for a LLM based app

Building extensive test suites with evals for a LLM based app takes time. The best way it to start small and extend them gradually with more automated tests.

**Where do you need Test suites with evals?**
* during development for *regression testing*: run test suites whenever you modify any part of your LLM based app, such as trying a new retrieval strategy, model version, or prompt. The goal is to check that updates don't make the quality worse or introduce new errors. You compare new responses with references or against set of criteria.
* in production for *continuous testing*: run test suites periodically over production logs to check that the output quality stays within expectations.

**What kind of tests your Test suite can have?**

Below you can find some test types, this list is not exhaustive.

*Example-based tests*

By breaking down the scope of your LLM based app into features and scenarios, you can write assertions for different scenarios. Such tests are usually fast and cheap to run. 

*Behaviour-driven tests*

By writing feature scenarios and expectations in natural language, namely gherkin format (given, when, then), you will be able to align better on the requirements with non-technical stakeholders. For this you can use *behave* (Python OSS library for behavior-driven development).

*Adversarial tests*

Tests to detect adversarial prompting and more.

It's enough theory for now, let's try it out.

So far, you've used Evidently Reports to summarize evaluation outcomes. Now you need to set specific conditions for the metric values to run automated tests (such as check if all texts fall within the expected length range) and review results. This is where you can use an alternative interface called **TestSuites**. TestSuites work similarly to Reports, but instead of listing metrics, you define tests and set conditions using parameters like gt (greater than), lt (less than), eq (equal), etc. 

**Define a Test Suite**

Add tests to check the following conditions:
* response length is always non-zero
* maximum response length does not exceed 1800 symbols (e.g., due to UI constraints).
* mean response length is above 500 symbols (e.g., this is a known pattern).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(column_name = TextLength().on("response"), gt=0),
    TestColumnValueMax(column_name = TextLength().on("response"), lte=1800),
    TestColumnValueMean(column_name = TextLength().on("response"), gt=500),
])

test_suite.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
test_suite

**Exercise 4**

Enrich the current test suite with a new condition: average response sentiment is positive.

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(column_name = TextLength().on("response"), gt=0),
    TestColumnValueMax(column_name = TextLength().on("response"), lte=1800),
    TestColumnValueMean(column_name = TextLength().on("response"), gt=500),
    ,
])

test_suite.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
test_suite

**Custom Test Suite**

You can start by re-using available tests presets, later you can design a custom Test Suite by picking up specific Tests and setting conditions more precisely. 
1. Choose individual Tests: select the tests you want to include in your Test Suite.
2. Pass Test parameters: set custom parameters for applicable Tests
3. Set custom conditions: define when Tests should pass or fail.
4. Mark Test criticality: mark non-critical Tests to give a Warning instead of Fail. 

More extended information can be found [here](https://docs.evidentlyai.com/user-guide/tests-and-reports/run-tests#custom-test-suite)

## Monitoring foundations for a LLM based app

**Observability** provides a comprehensive view of your LLM based app performance beyond predefined metrics. The more complex your LLM based app is, the more critical it becomes to have a proper observability. 

**How monitoring is related to evaluation?**

As a part of your observability system, both *monitoring* and *evaluation* share a common objective: to minimize potential risks (such as app failures, security attacks, and drifts) and identify opportunities for enhancing app performance and reducing costs. Evaluation metrics should be translated into monitoring metrics. Issues detected during monitoring should be incorporated back into your evaluation process. 

**Metrics as the backbone of your monitoring system**

Before deciding which metrics to track, identify the potential failures you want to detect and then design metrics specifically to catch those failures. Format errors are the simplest failures to start with. Length-related metrics will help you to track latency and costs, as longer contexts and responses typically increase latency and incur higher costs. Latency metrics can be time to first token, time per output token, total response latency. Cost-related metrics can be number of queries, tokens per second. For safety, you can track toxicity and detect sensitive information in both inputs and outputs, the frequency of guardrail triggers and refusals to answer. Also, keep an eye on unusual queries, as these can highlight edge cases or potential attacks. Keep in mind that each component in your LLM based app will come with its own set of specific metrics. When calculating metrics, ensure they can be segmented by key dimensions like users, prompt or chain versions and types, and time. This detailed breakdown will help you to understand performance variations and identify specific issues.

**Your typical production debbuging workflow**
1. Your metrics signal an issue, but don't explain its cause. 
2. To understand the root cause you examine logs from the time the metrics indicated the issue. 
3. You correlate the errors found in those logs to the metrics to confirm the correct issue has been identified.

**Why metrics and logs are not enough for LLM based apps?**

Having *metrics* and *logs* is not enough for LLM based apps if you want to trace each request step-by-step through the system. This requires *tracing*, the detailed recording of a request’s execution path through various system components and services. It reveals the entire process from when a user sends a query to when the final response is returned to the user, including the actions the system takes, the documents retrieved, and the final prompt sent to the model. Additionaly, if measurable, it can show how much time each step took and its associated cost.


Feel free to play around with tracing in Evidently. It uses *tracely* - Python OSS library based on OpenTelemetry to track events in your LLM based apps.

## Concluding remarks

Main take-aways:
* evaluation-driven development steers development of LLM based apps to the right direction
* without proper evaluation framework your LLM based app is doomed
* automated test suits help you to bring your evaluation framework to life
* right combination of metrics, logs and tracing not only helps you sleep during the night, but also increase compliance of your AI system with the EU AI Act  

**Further resources**

* [What We’ve Learned From A Year of Building with LLMs](https://applied-llms.org/)
* [Patterns for Building LLM-based Systems & Products](https://eugeneyan.com/writing/llm-patterns/#evals-to-measure-performance)
* [Tutorial: LLM Regression testing](https://docs.evidentlyai.com/tutorials-and-examples/cookbook_llm_regression_testing)
* [Levels of Complexity: RAG Applications](https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/)
* [Creating a LLM-as-a-Judge That Drives Business Results](https://hamel.dev/blog/posts/llm-judge/)
* [Tutorial: LLM-as-a-judge](https://docs.evidentlyai.com/tutorials-and-examples/cookbook_llm_judge)
* [Adversarial Prompting in LLMs](https://www.promptingguide.ai/risks/adversarial)
* [Monitoring overview](https://docs.evidentlyai.com/user-guide/monitoring/monitoring_overview)
* [Tutorial: Tracing](https://docs.evidentlyai.com/tutorials-and-examples/tutorial_tracing)


**Used materials**

1. [LLM Evaluation course by Evidently](https://youtube.com/playlist?list=PL9omX6impEuMgDFCK_NleIB0sMzKs2boI&feature=shared)
2. [LLM Evaluation tutorial by Evidently](https://docs.evidentlyai.com/tutorials-and-examples/tutorial-llm)
3. [AI Engineering book by Chip Huyen](https://learning.oreilly.com/library/view/ai-engineering/9781098166298/)
4. [Your AI Product Needs Evals by Hamel Husain](https://hamel.dev/blog/posts/evals/)
5. [Evaluation-driven development workflow by Databricks](https://docs.databricks.com/en/generative-ai/tutorials/ai-cookbook/evaluation-driven-development.html)
6. [LLM-workshop by MLOps and Crafts](https://github.com/mlops-and-crafts/llm-workshop/)
