## Performance evaluation: LLM-assisted QnA

**Overview**: In this example, we use UpTrain to evaluate the performance of a chat application used for natural language search against technical documenation. 

**Why is monitoring needed**: LLMs used for retrieval can often produce unconstrained output. Even with custom prompts designed to coerce the output, the response can be irrelevant to the query. Further, for technical documentation, misleading responses can end up wasting more time than no response at all.

Monitoring NLP tasks with traditional metrics (such as accuracy) in production is hard, as groud truth is unavailable (or extremely delayed when there is a human in the loop).

**Problem**: The workflow of our hypothetical chat application goes like,
- User enters a natural language query. 
- The query is converted to an embedding, and relevant sections from the documentation are retrieved using nearest neighbor search. 
- The original query along with the retrieved sections are passed to a language model (LM), along with a custom prompt to generate a response. 

We use a dataset built from logs generated by a chatbot made to answer questions from the [Streamlit user documentation](https://docs.streamlit.io/). 

**Solution**: We illustate how to use the Uptrain Evals framework to assess the performance of the chatbot. 

## Install required packages

```bash
pip install uptrain[full] # Install UpTrain with all dependencies
```

In [2]:
import os
import polars as pl

SCRATCH_SPACE = "/tmp/uptrain-scratch/"
os.makedirs(SCRATCH_SPACE, exist_ok=True)

Uptrain primitives share configuration like openai api keys, or the logs folder to write to through a single `Settings` object. 

In [3]:
from uptrain.framework import Settings

UPTRAIN_LOGS_DIR = os.path.join(SCRATCH_SPACE, "logs")
UPTRAIN_SETTINGS = Settings(logs_folder=UPTRAIN_LOGS_DIR)

In [4]:
# download the dataset if not present
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join(SCRATCH_SPACE, "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)

#### explore the dataset

In [5]:
dataset = pl.read_ndjson(dataset_path)

[/home/runner/work/polars/polars/polars/polars-io/src/ndjson/core.rs:162] &data_type = Struct(
    [
        Field {
            name: "question",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_title",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_link",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_text",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "answer",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "question_idx",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
        

This dataset has multiple log entries corresponding to a single user query. 

### Evaluations

We can do a few different checks to see if the chatbot is working as expected. 

**Response Document similarity**

We want to make sure the LLM responses are close to the document provided as context and it is not hallucinating technical terms. A quick proxy for this is to compare how similar the embeddings of the response and the document are.

Uptrain operators are implemented as regular dataset transformations that take a polars dataframe as input, and produce another. This makes it easy to integrate them into existing workflows. 

In [12]:
from uptrain.operators import SelectOp, CosineSimilarity
from uptrain.operators.language import RougeScore

op_1 = SelectOp(
    columns={
        "hallucination-score": RougeScore(
            score_type="precision",
            col_in_generated="response", 
            col_in_source="document_text"
        ),
        "similarity-question-context": CosineSimilarity(
            col_in_vector_1="question_embeddings",
            col_in_vector_2="context_embeddings",
        ),
    }
)

out_df = op_1.setup(UPTRAIN_SETTINGS).run(dataset)["output"]

We could visualize the distribution of hallucination scores by doing a quick histogram plot.

In [13]:
import plotly.express as px

px.histogram(out_df.to_pandas(), x="hallucination-score")

In [15]:
import plotly.express as px

px.histogram(out_df.to_pandas(), x="similarity-question-context")

**How similar are the documents retrieved for a query?**

We could assess how good our ANN based retrieval is working by looking at the distribution of the cosine similarity between all the retrieved documents. 

TODO: Does a dispersion mean better retrieval?

In [8]:
from uptrain.operators import Distribution

op_2 = Distribution(
    kind="rouge",
    col_in_embs=["document_text"],
    col_in_groupby=["question_idx", "experiment_id"],
    col_out=["rogue_f1"],
)

out_df = op_2.setup(UPTRAIN_SETTINGS).run(dataset)["output"]

In [9]:
px.histogram(out_df.to_pandas(), x="rogue_f1", nbins=20)

### Running multiple evaluations together

While you could use uptrain to run evaluations one at a time, it can often be useful to include it as part of your automated testing workflow. Uptrain provides the concept of a `Check set` to make this easier - you list a set of checks you want to run, uptrain runs them all together at the defined schedule and provides you a dashboard to analyze the results. 

In [10]:
from uptrain.framework import CheckSet, SimpleCheck
from uptrain.io import JsonReader
from uptrain.operators import PlotlyChart

list_checks = [
    SimpleCheck(name="check_1", sequence=[op_1], plot=[
        PlotlyChart.Histogram(x="hallucination-score", nbins=20),
        PlotlyChart.Histogram(x="similarity-question-context", nbins=20),
    ]),
    SimpleCheck(name="check_2", sequence=[op_2], plot=[
        PlotlyChart.Histogram(x="rogue_f1", nbins=20),
    ])    
]

check_set = CheckSet(
    source=JsonReader(fpath=dataset_path),
    checks=list_checks,
    settings=Settings(logs_folder=UPTRAIN_LOGS_DIR),
)

In [11]:
check_set.setup().run()

[/home/runner/work/polars/polars/polars/polars-io/src/ndjson/core.rs:162] &data_type = Struct(
    [
        Field {
            name[32m2023-06-29 10:40:53.649[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: check_1[0m
: "question",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_title",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_link",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_text",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "answer",
            data_type: LargeUtf8,
            is_nullable: tru

In [None]:
# once run, you can start up a streamlit dashboard against the logs folder to see the results

from uptrain.dashboard import StreamlitRunner

runner = StreamlitRunner(LOGS_DIR)
runner.start()