## How to use UpTrain to validate LLM responses

**Overview**: In this example, we will see how you can use UpTrain to ensure that your LLM responses are adequate before you use them to perform downstream tasks. A list of defined checks performs the validation. If the LLM's response is invalid, UpTrain will keep retrying until the model returns a valid one. We will use a Q&A task as an example to highlight the same.

**Why is validation Needed**: LLMs are great, but they are not 100% reliable. Downstream tasks require the LLM response in a particular structure. Sometimes the response produced by the LLM deviates from the required format. This deviation causes all sorts of problems. LLMs can hallucinate randomly. We surely don't want to show those results to our users. Hence, we have to run validation checks on our LLM responses, catch where they go wrong and retry the LLMs. This process repeats until the LLM output passes all the validation checks.

**Problem**: The workflow of our hypothetical Q&A application goes like this,
- User enters a question. 
- The query converts to an embedding, and relevant sections from the documentation are retrieved using nearest neighbour search. 
- The original query and the retrieved sections are passed to a language model (LM), along with a custom prompt to generate a response. 

**Solution**: We will illustate how to use the "Uptrain Validation framework" to validate the performance of the chatbot. We will use a dataset built from logs generated by a chatbot made to answer questions from the [Streamlit user documentation](https://docs.streamlit.io/). 

## Install required packages

```bash
pip install uptrain[full]  # Install UpTrain with all dependencies
```

#### Make sure to define openai_api_key

In [11]:
import os
import openai
import polars as pl
import json


**This notebook uses the OpenAI API to generate text for prompts, make sure the env variable is populated with the API key.**

In [12]:
# os.environ["OPENAI_API_KEY"] = "..."

# Let's first define our prompt and model

We have designed a prompt template to take in a question and a document and extract the relevant sections from it.

In [13]:
prompt_template = """
    You are a developer assistant that can only quote text from documents. 
    You will be given a section of technical documentation titled {document_title}.
    
    The input is: '{question}?'. 

    Your task is to answer the question by quoting exactly all sections of the document that are relevant to any topics of the input. 
    Copy the text exactly as found in the original document. 
    
    Okay, here is the document:
    --- START: Document ---
    
    {document_text}

    -- END: Document ---
    Now do the task. If there are no relevant sections, just respond with \"<EMPTY MESSAGE>\".
    
    Here is the answer:
"""


Let's now load our dataset and see how that looks

In [14]:
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join("datasets", "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx

    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)

dataset = pl.read_ndjson(dataset_path).select(
    pl.col(["question", "document_title", "document_text"])
)
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])


Number of test cases:  90
Couple of samples:  shape: (2, 3)
┌─────────────────────────────┬──────────────────────────────┬─────────────────────────────────────┐
│ question                    ┆ document_title               ┆ document_text                       │
│ ---                         ┆ ---                          ┆ ---                                 │
│ str                         ┆ str                          ┆ str                                 │
╞═════════════════════════════╪══════════════════════════════╪═════════════════════════════════════╡
│ How to use the sessionstate ┆ What is serializable session ┆ ## Serializable Session State\n\…   │
│ feat…                       ┆ sta…                         ┆                                     │
│ How can I create histograms ┆ API reference                ┆ ader(\"Define a custom colorscal…   │
│ with…                       ┆                              ┆                                     │
└─────────────────────────────┴

Let's now get responses from our LLM by defining our completion function. 
We are using GPT-3.5-Turbo for the same.

In [5]:
def get_model_response(input_dict):
    prompt = [{"role": "system", "content": prompt_template.format(**input_dict)}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=prompt, temperature=0.1
    )
    message = response.choices[0]["message"]["content"]
    return message


Now that we have completed the setup, let's try out a few examples to see how they look.

In [6]:
print(
    json.dumps(
        {
            "input_question": dataset["question"][0],
            "llm_response": get_model_response(dataset.to_dicts()[0]),
        },
        indent=1,
    ),
    "\n",
)
print(
    json.dumps(
        {
            "input_question": dataset["question"][1],
            "llm_response": get_model_response(dataset.to_dicts()[1]),
        },
        indent=1,
    ),
    "\n",
)
print(
    json.dumps(
        {
            "input_question": dataset["question"][5],
            "llm_response": get_model_response(dataset.to_dicts()[5]),
        },
        indent=1,
    ),
    "\n",
)


{
 "input_question": "How to use the sessionstate feature in Streamlit",
 "llm_response": "By default, Streamlit\u2019s [Session State](https://docs.streamlit.io/library/advanced-features/session-state) allows you to persist any Python object for the duration of the session, irrespective of the object\u2019s pickle-serializability. This property lets you store Python primitives such as integers, floating-point numbers, complex numbers and booleans, dataframes, and even [lambdas](https://docs.python.org/3/reference/expressions.html#lambda) returned by functions. However, some execution environments may require serializing all data in Session State, so it may be useful to detect incompatibility during development, or when the execution environment will stop supporting it in the future.\n\nTo that end, Streamlit provides a `runner.enforceSerializableSessionState` [configuration option](https://docs.streamlit.io/library/advanced-features/configuration) that, when set to `true`, only allows

As we can see, our model gives us empty responses for certain cases. Let's see how we can use the UpTrain Validation Framework to check for the same and retry the LLM whenever that happens.

# Using Validation Framework to check for empty responses

Let's define a `Check` to evaluate if the model response is empty or not. We utilize the pre-built `TextComparison` operator for the same. After running this on our input data a new variable called 'is_empty_response' is created.

In [7]:
from uptrain.framework import Check
from uptrain.operators import TextComparison

check = Check(
    name="empty_response_validation",
    sequence=[
        TextComparison(
            reference_text="<EMPTY MESSAGE>",
            col_in_text="response",
            col_out="is_empty_response",
        ),
    ],
)


Our pass condition is defined as "any response that is not empty". UpTrain provides a wrapper function called Signal which allows us to define the pass condition by utilizing mathematical operators (like ~, &, |, +, etc.).


In [8]:
from uptrain.framework import Signal

pass_condition = ~Signal("is_empty_response")


UpTrain provides a `ValidationManager` class that allows us to pass the `Check`, completion_function and pass_condition. Instead of calling the completion_function, we can call validation_manager. Under the hood, it computes the check, makes sure the pass condition is validated, and if the pass condition is not validated, it will retry until it outputs the correct LLM response.

In [9]:
from validation_wrapper import ValidationManager

validation_manager = ValidationManager(
    check=check,
    completion_function=get_model_response,
    pass_condition=pass_condition,
)
validation_manager.setup()


Finally, let's run it a few values from our input dataset.

In [10]:
for inputs in dataset.to_dicts()[:20]:
    validated_response = validation_manager.run(inputs)


[32m2023-07-05 15:21:39.662[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m119[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-07-05 15:21:39.854[0m | [32m[1mSUCCESS [0m | [36mvalidation_wrapper.validation_manager[0m:[36mrun[0m:[36m33[0m - [32m[1mValidation check PASSED after 1 attempt(s)[0m
[32m2023-07-05 15:21:40.481[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m119[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-07-05 15:21:41.506[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m119[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-07-05 15:21:42.607[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m119[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m