## How to use UpTrain to validate LLM responses

**Overview**: In this example, we will see how to validate our LLM responses using UpTrain, before passing them to downstream tasks. The validation, is done by a list of defined checks. If the generated output is not validated, UpTrain will retry the LLM till it is. We will use a QnA task as an example to highlight the same.

**Why is validation needed**: LLMs are great but they are not 100% reliable. Downstream tasks require the LLM response in a particular structure. Sometimes the response produced by the LLM deviates from the required structure, this could cause all sorts of problems. LLMs can hallucinate randomly, and we definitely don't want to show those results to our users. Hence, we have to run validation checks on our LLM responses, catch where they go wrong and retry the LLMs. This process is repeated until the LLM output passes all the validation checks.

**Task Setup**: The workflow of our hypothetical QnA application goes like,
- User enters a natural language query. 
- The query is converted to an embedding, and relevant sections from the documentation are retrieved using nearest neighbor search. 
- The original query along with the retrieved sections is passed to a language model (LM), along with a custom prompt to generate a response. 

We use a dataset built from logs generated by a chatbot made to answer questions from the [Streamlit user documentation](https://docs.streamlit.io/). 

**Solution**: We illustate how to use the "Uptrain Validation framework" to validate the performance of the chatbot. 

## Install required packages

```bash
pip install uptrain[full]  # Install UpTrain with all dependencies
```

#### Make sure to define openai_api_key

In [6]:
import os
os.environ['OPENAI_API_KEY'] = "..."
import openai
import polars as pl
import json

# Let's first define our prompt and model

We have designed a prompt template to take in a question and a document and extract the relevant sections from it.

In [2]:
prompt_template = """
    You are a developer assistant that can only quote text from documents. 
    You will be given a section of technical documentation titled {document_title}.
    
    The input is: '{question}?'. 

    Your task is to answer the question by quoting exactly all sections of the document that are relevant to any topics of the input. 
    Copy the text exactly as found in the original document. 
    
    Okay, here is the document:
    --- START: Document ---
    
    {document_text}

    -- END: Document ---
    Now do the task. If there are no relevant sections, just respond with \"<EMPTY MESSAGE>\".
    
    Here is the answer:
"""


Let's now load our dataset and see how that looks

In [3]:
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join("datasets", "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)

dataset = pl.read_ndjson(dataset_path).select(pl.col(['question', 'document_title', 'document_text']))
print(dataset[0:2])

shape: (2, 3)
┌─────────────────────────────┬──────────────────────────────┬─────────────────────────────────────┐
│ question                    ┆ document_title               ┆ document_text                       │
│ ---                         ┆ ---                          ┆ ---                                 │
│ str                         ┆ str                          ┆ str                                 │
╞═════════════════════════════╪══════════════════════════════╪═════════════════════════════════════╡
│ How to use the sessionstate ┆ What is serializable session ┆ ## Serializable Session State\n\…   │
│ feat…                       ┆ sta…                         ┆                                     │
│ How can I create histograms ┆ API reference                ┆ ader(\"Define a custom colorscal…   │
│ with…                       ┆                              ┆                                     │
└─────────────────────────────┴──────────────────────────────┴───────────────

Let’s now get responses from our LLM by defining our completion function. 
We are using GPT-3.5-Turbo for the same

In [4]:
def get_model_response(input_dictn):
    prompt = [{"role": "system", "content": prompt_template.format(**input_dictn)}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=prompt,
        temperature=0.1
    )
    message = response.choices[0]['message']['content']
    return message

Now that we have completed the setup, let's try out few examples to see how this looks.

In [11]:
print(json.dumps({'input_question': dataset['question'][0], 'llm_response': get_model_response(dataset.to_dicts()[0])}, indent=1), "\n")
print(json.dumps({'input_question': dataset['question'][1], 'llm_response': get_model_response(dataset.to_dicts()[1])}, indent=1), "\n")
print(json.dumps({'input_question': dataset['question'][5], 'llm_response': get_model_response(dataset.to_dicts()[5])}, indent=1), "\n")

{
 "input_question": "How to use the sessionstate feature in Streamlit",
 "llm_response": "By default, Streamlit\u2019s [Session State](https://docs.streamlit.io/library/advanced-features/session-state) allows you to persist any Python object for the duration of the session, irrespective of the object\u2019s pickle-serializability. This property lets you store Python primitives such as integers, floating-point numbers, complex numbers and booleans, dataframes, and even [lambdas](https://docs.python.org/3/reference/expressions.html#lambda) returned by functions. However, some execution environments may require serializing all data in Session State, so it may be useful to detect incompatibility during development, or when the execution environment will stop supporting it in the future.\n\nTo that end, Streamlit provides a `runner.enforceSerializableSessionState` [configuration option](https://docs.streamlit.io/library/advanced-features/configuration) that, when set to `true`, only allows

As we notice from our prompt, our model gives us empty responses for certain cases. Let's see how we can use the UpTrain validation framework to check for the same and retry the LLM whenever that happens.

# Using Validation Framework to check for empty responses

Let's define a "simple check"(uptrain tool) to evaluate if the model response is empty or not. We utilize the pre-built TextComparison operator for the same. After running this on our input data a new variable called 'is_empty_response' is created.

In [12]:
from uptrain.framework import Check, Signal
from uptrain.operators import (
    SelectOp,
)
from uptrain.operators.language import (
    TextComparison,
)
from validation_wrapper import ValidationManager

check = Check(
        name="empty_response_validation",
        sequence=[
            SelectOp(
                columns={
                    "is_empty_response": TextComparison(
                        reference_text="<EMPTY MESSAGE>",
                        col_in_text="response",
                    ),
                }
            )
        ],
    )



Our pass condition is defined as "any response that is not empty". UpTrain provides a wrapper function called SIgnal which allows us to define the pass condition by utilizing mathematical operators (like ~, &, |, +, etc.).


In [13]:
pass_condition = ~Signal('is_empty_response')

UpTrain provides a validation manager class. This class allows us to pass the check function, completion_function and pass_condition. Instead of calling the completion_function, we can simply call validation_manager. Under the hood, it computes the check, makes sure the pass condition has been validated and if not, it will retry until ouputing the right LLM response.

In [14]:
validation_manager = ValidationManager(
    check=check,
    completion_fn=get_model_response,
    pass_condition=~Signal('is_empty_response')
)
validation_manager.setup()

Finally, let's run it on our input dataset.

In [17]:
for inputs in dataset.to_dicts():
    validated_response = validation_manager.run(inputs)

[32m2023-06-30 14:15:02.901[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-30 14:15:03.411[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-30 14:15:03.913[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-30 14:15:04.434[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-30 14:15:04.947[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_