## How to use UpTrain to validate LLM responses

**Overview**: In this example, we will see how to validate your LLM responses before passing them to downstream tasks. The validation is done on the list of checks defined where UpTrain retries the LLM till it gets a valid response. We will be using a QnA task to highlight the same.

**Why is validation needed**: Well, LLMs are great but they can also go horribly wrong. Your downstream tasks expect the LLM response in a certain structure and for certain cases, that might not be the case. LLMs can hallucinate randomly, and you definitely don't want to show those results to your users. Hence, you want to run validation checks on our LLM responses, retry if they are wrong, and output only the final valid responses which pass all the checks.

**Task Setup**: The workflow of our hypothetical QnA application goes like,
- User enters a natural language query. 
- The query is converted to an embedding, and relevant sections from the documentation are retrieved using nearest neighbor search. 
- The original query along with the retrieved sections is passed to a language model (LM), along with a custom prompt to generate a response. 

We use a dataset built from logs generated by a chatbot made to answer questions from the [Streamlit user documentation](https://docs.streamlit.io/). 

**Solution**: We illustate how to use the Uptrain Validation framework to validate the performance of the chatbot. 

## Install required packages

```bash
pip install uptrain[full]  # Install UpTrain with all dependencies
```

#### Make sure to define openai_api_key

In [1]:
import os
os.environ['OPENAI_API_KEY'] = "..."
os.environ['OPENAI_API_KEY'] = open("/Users/sourabhagrawal/Desktop/codes/llm/uptrain_experiments/uptrain_experiments/manager/key.txt", "r").read()
import openai
import polars as pl

openai.api_key = open("/Users/sourabhagrawal/Desktop/codes/llm/uptrain_experiments/uptrain_experiments/manager/key.txt", "r").read()

# Let's first define our prompt and model

We have designed the prompt to take in a question and a document and extract the relevant sections from it.

In [2]:
prompt_template = """
    You are a developer assistant that can only quote text from documents. 
    You will be given a section of technical documentation titled {document_title}.
    
    The input is: '{question}?'. 

    Your task is to quote exactly all sections of the document that are relevant to any topics of the input. 
    Copy the text exactly as found in the original document. 
    
    Okay, here is the document:
    --- START: Document ---
    
    {document_text}

    -- END: Document ---
    Now do the task. If there are no relevant sections, just respond with \"<EMPTY MESSAGE>\".
    
    Here are the exact sections from the document:
"""


Let's now load our dataset and see how that looks

In [3]:
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join("datasets", "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)

dataset = pl.read_ndjson(dataset_path).select(pl.col(['question', 'document_title', 'document_text']))
print(dataset[0:2])

shape: (2, 3)
┌─────────────────────────────┬──────────────────────────────┬─────────────────────────────────────┐
│ question                    ┆ document_title               ┆ document_text                       │
│ ---                         ┆ ---                          ┆ ---                                 │
│ str                         ┆ str                          ┆ str                                 │
╞═════════════════════════════╪══════════════════════════════╪═════════════════════════════════════╡
│ How to use the sessionstate ┆ What is serializable session ┆ ## Serializable Session State       │
│ feat…                       ┆ sta…                         ┆                                     │
│                             ┆                              ┆ S…                                  │
│ How can I create histograms ┆ API reference                ┆ ader("Define a custom colorscale…   │
│ with…                       ┆                              ┆               

[/Users/runner/work/polars/polars/polars/polars-io/src/ndjson/core.rs:162] &data_type = Struct(
    [
        Field {
            name: "question",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_title",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_link",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "document_text",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "answer",
            data_type: LargeUtf8,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "question_idx",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
       

Let's now define our completion function i.e. how we get response from our LLM. We are using GPT-3.5-Turbo for the same

In [4]:
def get_model_response(input_dictn):
    prompt = [{"role": "system", "content": prompt_template.format(**input_dictn)}]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=prompt,
        temperature=0.1
    )
    message = response.choices[0]['message']['content']
    return message

As we complete our setup, let's try out few examples to see how this looks

In [5]:
print({'input_question': dataset['question'][0], 'llm_response': get_model_response(dataset.to_dicts()[0])}, "\n")
print({'input_question': dataset['question'][1], 'llm_response': get_model_response(dataset.to_dicts()[1])}, "\n")
print({'input_question': dataset['question'][4], 'llm_response': get_model_response(dataset.to_dicts()[5])}, "\n")

{'input_question': 'How to use the sessionstate feature in Streamlit', 'llm_response': 'By default, Streamlit’s [Session State](https://docs.streamlit.io/library/advanced-features/session-state) allows you to persist any Python object for the duration of the session, irrespective of the object’s pickle-serializability.\n\nTo that end, Streamlit provides a `runner.enforceSerializableSessionState` [configuration option](https://docs.streamlit.io/library/advanced-features/configuration) that, when set to `true`, only allows pickle-serializable objects in Session State.'} 

{'input_question': 'How can I create histograms with different bucket colors in Streamlit', 'llm_response': '```\nader("Define a custom colorscale")\ndf = px.data.iris()\nfig = px.scatter(\n    df,\n    x="sepal_width",\n    y="sepal_length",\n    color="sepal_length",\n    color_continuous_scale="reds",\n)\n```\n```\nNotice how the custom color scale is still reflected in the chart, even when the Streamlit theme is ena

As we notice from our prompt, our model can give empty responses for certain cases. Let's see how we can use UpTrain validation framework to check for the same and retry whenever that happens.

# Using Validation Framework to check for empty responses

Let's define a simple check to check if the model response is empty or not. We utilize the pre-built TextComparison operator for the same. This creates a new variable called 'is_empty_response' by running this check on our input data.

In [6]:
from uptrain.framework import SimpleCheck, Signal
from uptrain.operators import (
    SelectOp,
)
from uptrain.operators.language import (
    TextComparison,
)
from validation_wrapper import ValidationManager

check = SimpleCheck(
        name="empty_response_validation",
        sequence=[
            SelectOp(
                columns={
                    "is_empty_response": TextComparison(
                        reference_text="<EMPTY MESSAGE>",
                        col_in_text="response",
                    ),
                }
            )
        ],
    )


Let's define our pass condition as whenever the response is not empty. UpTrain provides a nice wrapper called Signal which allows you to define this pass condition by utilizing mathematical operators (like ~, &, |, +, etc.)

In [7]:
pass_condition = ~Signal('is_empty_response')

UpTrain provides a validation manager class where you pass your check, completion_function, and pass_condition. Now, instead of calling the completion_function, you simply call validation_manager.run with your inputs, and under the hood, it computes the check, see if the pass condition is true, and if not, retry before outputting the LLM response.

In [8]:
validation_manager = ValidationManager(
    check=check,
    completion_fn=get_model_response,
    pass_condition=~Signal('is_empty_response')
)
validation_manager.setup()

Finally, let's run on our input dataset

In [9]:
for inputs in dataset.to_dicts():
    validated_response = validation_manager.run(inputs)

[32m2023-06-29 23:16:24.662[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-29 23:16:27.547[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-29 23:16:28.893[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-29 23:16:32.142[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_validation[0m
[32m2023-06-29 23:16:34.765[0m | [34m[1mDEBUG   [0m | [36muptrain.framework.base[0m:[36mrun[0m:[36m106[0m - [34m[1mExecuting node: sequence_0 for operator DAG: empty_response_