# Workshop: Evaluating LLM based applications

It is so easy and quick to build a shiny PoC using LLMs and it is so hard to turn it into a production-grade LLM application. To succeed you need a robust evaluation framework, which you are going to use during the development and post-deployment of your LLM based app.

This workshop consists of 4 main parts:
* evaluation-driven development and architecture of a LLM based app
* evaluation framework for a LLM based app
* test suite with evals for a LLM based app 
* monitoring foundations for a LLM based app

**About workshop giver**

 With over 19 years of experience in Data and AI, [Una Galyeva](https://www.linkedin.com/in/unagalyeva/) held various positions, from hands-on Data and AI development to leading Data and AI teams and departments. As a driving force behind [PyLadies Amsterdam](https://amsterdam.pyladies.com/), a Microsoft MVP, Women in AI Benelux Advisory board member, and the owner of AI MLOps Agency, Una is passionate about challenging perspectives and inspiring others to see things differently.

## Evaluation-driven development and architecture of a LLM based app


### Evaluation-driven development

**Evaluation-driven development** is a methodology to guide the development of LLM based apps via a set of task-specific evaluations. This term is inspired by test-driven development in software engineering. 

![Evaluation Driven Development Workflow](../assets/EDD.png)

*Source: [Evaluation-driven development workflow](https://docs.databricks.com/en/generative-ai/tutorials/ai-cookbook/evaluation-driven-development.html) by Databricks*

### Architecture

A LLM based app architecture can be quite complex. It starts small and could be expanded step by step by adding new components.
* basic LLM app: a user query is passed directly to a LLM model, its response is passed back to the user
* enhanced context: LLM model is given access to external data and tools for creating more informed responses
* guardrails in place to protect both the LLM app and its users
* model router and gateaway to support complex pipelines and enhance security
* performance optimization to reduce latency and costs with caching
* agent patterns to incorporate complex logic and actions to maximize the LLM app capabilities

![Architecure of a LLM based app](../assets/LLM_app_arch.png)

*Source: Chapter 10 of [AI Engineering](https://www.oreilly.com/library/view/ai-engineering/9781098166298/) by Chip Huyen*


## Evaluation framework for a LLM based app

TODO base descr

### Setup and dataset prep

TODO add info about our usecase, steps and brief descr of Evidently lib 

Each evaluation that computes a score for every text in the dataset is called a **descriptor**. Descriptors can be numerical or categorical.

In [1]:
import numpy as np
import pandas as pd
import requests
from datetime import datetime, timedelta
from evidently import ColumnMapping
from evidently.descriptors import *
from evidently.metric_preset import TextEvals
from evidently.metrics import *
from evidently.report import Report
from evidently.test_suite import TestSuite
from evidently.tests import *
from io import BytesIO

For this workshop you will work with a question answering dataset that imitates an internal company Q&A system (created to answer employees' questions about HR, finance, etc). 

Import it using requests library, convert into pandas data frame, parse dates and set 'start_time' as an index.

In [2]:
response = requests.get("https://raw.githubusercontent.com/pyladiesams/eval-llm-based-apps-jan2025/main/assets/QA.csv")
qa_csv_content = BytesIO(response.content)
qa_logs = pd.read_csv(qa_csv_content, index_col=0, parse_dates=['start_time', 'end_time'])
qa_logs.index = qa_logs.start_time
qa_logs.index.rename('index', inplace=True)

Get a preview of the first three rows of the dataset to familiarize yourself with it a bit.

In [None]:
pd.set_option('display.max_colwidth', None)
qa_logs.head(3)

While working with Evidently it is highly recommended to map the data schema to make sure that it is parsed correctly.

To handle this, create a column mapping by identifying the type of columns and pointing to a "datetime" column for adding a time index to your plots.

In [4]:
column_mapping = ColumnMapping(
    datetime='start_time',
    datetime_features=['end_time'],
    text_features=['question', 'response'],
    categorical_features=['organization', 'model_ID', 'region', 'environment', 'feedback'],
)

You can export the evaluation results beyond viewing the visual Reports in Python. Currently Evidently supports export to a data frame, Python dictionary, JSON and HTML.


TODO add proper transition paragraph

If your LLM solves a classification or retrieval task, you can evaluate classification or ranking quality. See available Presets, Metrics, and Tests to see other checks you can run.

TODO check links to the presets, metrics and tests.

### Text statistics

Computes descriptive text statistics by evaluating simple properties like text length, sentence count, word count, percentage of out-of-vocabulary words, percentage of non-letter characters.

**Evaluate text length**

Generate a Report to evaluate the length of each text in the *response* column. Run this check for the first 200 rows of the *qa_logs* dataframe.

This calculates the number of symbols in each text and shows a summary. You can see the distribution of the text length across all responses and descriptive statistics like the mean or minimal text length. 

Click on Details to see how the mean text length changes over time. The index comes from the *datetime* column you mapped earlier. This helps you to notice any temporal patterns, such as if texts are longer or shorter during specific periods.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Get a side-by-side comparison**. 

You can also generate statistics for two datasets at once. For example, compare the outputs of two different prompts or data from today against yesterday.

Pass one dataset as a reference one and another as a current one. For simplicity, let's compare the text length for the first and next 100 rows from the same dataframe:

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=qa_logs[:100],
                      current_data=qa_logs[100:200],
                      column_mapping=column_mapping)
text_evals_report

**Exercise 1**

Get a side by side comparison of the sentence count for the first and the next 100 rows from the same dataframe. Use SentenceCount() descriptor.
Consult the descriptor docs [here](https://docs.evidentlyai.com/reference/all-metrics#descriptors-text-stats)

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  ,
                  ]
              )
])

text_evals_report.run(reference_data=qa_logs[:100],
                      current_data=qa_logs[100:200],
                      column_mapping=column_mapping)
text_evals_report

### Text patterns

Detect specific words or regular patterns by using regular expressions behind the scenes. Such evals are faster and cheaper to compute at scale. 

For example, check if the responses mention competitors, banned or forbidden words, include emails or other links. Text pattern descriptors return a binary score ("True" or "False") for pattern matches.

Let's check if *responses* contains specific words related to the compensation (such as salary, benefits, or payroll). Pass this word list to the IncludesWords() descriptor. This will also check for word variants. Add an optional display name for this eval.

You can see that 18 responses out of 200 relate to the topic of compensation as defined by this word list. Details show occurrences in time.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  IncludesWords(
                      words_list=['salary', 'benefits', 'payroll'],
                      display_name="Mention Compensation")
            ]
        ),
        ]
)

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Exercise 2**

Get a side by side comparison of the excluded words (competitor, offer) for the first and the next 100 rows from the same dataframe. Add "No competitor offer" display name for this eval. Use ExcludesWords() descriptor.
Consult the descriptor docs [here](https://docs.evidentlyai.com/reference/all-metrics#descriptors-text-patterns)

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
            ]
        ),
        ]
)

text_evals_report.run(reference_data=qa_logs[:100],
                      current_data=qa_logs[100:200],
                      column_mapping=column_mapping)
text_evals_report

### ML-based evaluation

Uses pre-trained machine learning models for evaluation. 

Evidently has built-in model-based descriptors and wrappers to call external models published on Hugging Face.

**Semantic similarity**

You can evaluate how closely two texts are in meaning using an embedding model. SemanticSimilarity() descriptor calculates pairwise semantic similarity between columns for each pair of text. You can compare the text from *Response* and *Question* columns to see if the answers are semantically relevant to the question.

SemanticSimilarity() descriptor converts all texts into embeddings, measures Cosine Similarity between them, and returns a score from 0 to 1:
* 0 means that texts are opposite in meaning;
* 0.5 means that texts are unrelated;
* 1 means that texts are semantically close.

In this case, the semantic similarity always stays above 0.81, which means that answers generally relate to the question.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        SemanticSimilarity(with_column="question", 
                           ),
    ])
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Sentiment analysis**

Analyzes the sentiment of the text using a word-based model. Returns score on a scale: -1 (negative) to 1 (positive). Shows the distribution of the response sentiment. Allows you to spot specific times when the average sentiment of the responses dipped.

**Exercise 3**

Execute a sentiment check on the first 200 responses


In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            ,
        ]
    ),
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

**Toxicity**

The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model. In this model, 'hate' is defined as abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation. This model returns a predicted toxicity score between 0 and 1. In each case, the descriptor first downloads the model from Hugging Face to your environment and then uses it to score the data. It takes a few moments to load the model. The higher the score the more toxic is your response.

In [None]:
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            HuggingFaceToxicityModel(),
        ]
    ),
])

text_evals_report.run(reference_data=None,
                      current_data=qa_logs[:200],
                      column_mapping=column_mapping)
text_evals_report

### LLM-as-a-judge

For more complex or nuanced checks, you can use LLMs as a judge. This requires creating an evaluation prompt asking the same or more powerful LLMs to assess the text by specific criteria, such as tone or conciseness.

TODO add info on pros and cons of LLM-as-a-judge

### Metadata summary

The QA dataset has a *feedback* column which includes user upvotes and downvotes. You can easily enrich your Report with summaries from any numerical or categorical columns.

Use ColumnSummaryMetric() to add a summary of the *feedback* column.

In [None]:
feedback_report = Report(metrics=[
   ColumnSummaryMetric(column_name="feedback"),
   ]
)

feedback_report.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
feedback_report

## Test suite with evals for a LLM based app

TODO base descr

So far, you've used Reports to summarize evaluation outcomes. However, you need to set specific conditions for the metric values to run automated tests (such as check if all texts fall within the expected length range) and review results only if something goes wrong.

This is where you can use an alternative interface called **TestSuites**. TestSuites work similarly to Reports, but instead of listing metrics, you define tests and set conditions using parameters like gt (greater than), lt (less than), eq (equal), etc. 

**Define a Test Suite**

Add tests to check the following conditions:
* response length is always non-zero
* maximum response length does not exceed 1800 symbols (e.g., due to chat window constraints).
* mean response length is above 500 symbols (e.g., this is a known pattern).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(column_name = TextLength().on("response"), gt=0),
    TestColumnValueMax(column_name = TextLength().on("response"), lte=1800),
    TestColumnValueMean(column_name = TextLength().on("response"), gt=500),
])

test_suite.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
test_suite

**Exercise 4**

Enrich the current test suite with a new condition: average response sentiment is positive.

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(column_name = TextLength().on("response"), gt=0),
    TestColumnValueMax(column_name = TextLength().on("response"), lte=1800),
    TestColumnValueMean(column_name = TextLength().on("response"), gt=500),
    ,
])

test_suite.run(reference_data=None, 
                    current_data=qa_logs[:200], 
                    column_mapping=column_mapping)
test_suite

**Custom Test Suite**

You can start by re-using available tests presets, later you can design a custom Test Suite by picking up specific Tests and setting conditions more precisely. Here how can you do it:

1. Choose individual Tests: select the tests you want to include in your Test Suite.
2. Pass Test parameters: set custom parameters for applicable Tests
3. Set custom conditions: define when Tests should pass or fail.
4. Mark Test criticality: mark non-critical Tests to give a Warning instead of Fail. 

More extended information can be found [here](https://docs.evidentlyai.com/user-guide/tests-and-reports/run-tests#custom-test-suite)

**Test Suite usage**

* *regression testing*: run test suites whenever you modify any part of your LLM system, such as trying a new retrieval strategy, model version, or prompt. The goal is to check that updates don't make the quality of generative outputs worse or introduce new errors. You compare new responses with references or against set of criteria.
* *continuous testing*: run test suites periodically over production logs to check that the output quality stays within expectations.

You can also set up alerts to get a notification if your Tests contain failures.

## Monitoring foundations for a LLM based app

TODO base descr, info on tracing (spans), recommendations

## Concluding remarks

TODO create summary from all parts with main take-aways

**Further resources**

TODO add list of the resources

**Used materials**

TODO add list of used materials