<a href="https://colab.research.google.com/github/wandb/edu/blob/main/llm-structured-extraction/3.1.validation-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{llmeng-1-nb4} -->

# Understanding Validators and controlling responses


Previously we went over how to use structured Extraction to Query and Plan a search request

In this section we'll aim to 

1. Expand on how Pydantic's validation features work
2. Apply them to generate better responses by using feedback and validation.

Pydantic offers a customizable and expressive validation framework for Python. Instructor leverages Pydantic's validation framework to provide a uniform developer experience for both code-based and LLM-based validation, as well as a reasking mechanism for correcting LLM outputs based on validation errors. To learn more check out the Pydantic [docs](https://docs.pydantic.dev/latest/) on validators.


Validators will enable us to control outputs by defining a function like so:

```python
def validation_function(value):
    if condition(value):
        raise ValueError("Value is not valid")
    return mutation(value)
```

Before we get started lets go over the general shape of a validator:


# Setup Colab

Run this code if you're using Google Colab, you can skip if you're running locally. You may need to restart Colab after installing requirements. 

In [1]:
from pathlib import Path

# Download files on colab
if not Path("requirements.txt").exists():
    !wget https://raw.githubusercontent.com/wandb/edu/main/llm-structured-extraction/{requirements.txt,helpers.py}
    !pip install -r requirements.txt -Uqq

In [2]:
import os
from getpass import getpass
import openai

# Setup your Openai API key
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")
  openai.api_key = os.getenv("OPENAI_API_KEY", "")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

Please enter password in the VS Code prompt at the top of your VS Code window!
OpenAI API key configured


## Using Weave for LLM Experiment Tracking

[Weave](https://wandb.github.io/weave/) is a lightweight toolkit by Weights & Biases for tracking and evaluating LLM applications. It allows you to:

- Log and debug language model inputs, outputs, and traces
- Build rigorous evaluations for LLM use cases
- Organize information across the LLM workflow

OpenAI calls are automatically logged to Weave.
`@weave.op()` allows you to log additional information to Weave.

In [3]:
import weave
weave.init("llmeng-1-nb4")

Logged in as Weights & Biases user: a-sh0ts.
View Weave data at https://wandb.ai/a-sh0ts/llmeng-1-nb4/weave


<weave.weave_client.WeaveClient at 0x147294590>

## Defining Validator Functions


In [4]:
from typing_extensions import Annotated
from pydantic import BaseModel, AfterValidator, WithJsonSchema


def name_must_contain_space(v: str) -> str:
    if " " not in v:
        raise ValueError("Name must contain a space.")
    return v

def uppercase_name(v: str) -> str:
    return v.upper()

FullName = Annotated[
    str, 
    AfterValidator(name_must_contain_space), 
    AfterValidator(uppercase_name),
    WithJsonSchema(
        {
            "type": "string",
            "description": "The user's full name",
        }
    )]

class UserDetail(BaseModel):
    age: int
    name: FullName

In [5]:
UserDetail(age=30, name="Jason Liu")

UserDetail(age=30, name='JASON LIU')

In [6]:
UserDetail.model_json_schema()

{'properties': {'age': {'title': 'Age', 'type': 'integer'},
  'name': {'description': "The user's full name",
   'title': 'Name',
   'type': 'string'}},
 'required': ['age', 'name'],
 'title': 'UserDetail',
 'type': 'object'}

In [7]:
try:
    person = UserDetail.model_validate({"age": 24, "name": "Jason"})
except Exception as e:
    print(e)

1 validation error for UserDetail
name
  Value error, Name must contain a space. [type=value_error, input_value='Jason', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


## Using Field

We can also use the `Field` class to define validators. This is useful when we want to define a validator for a field that is primitive, like a string or integer which supports a limited number of validators.


In [8]:
from pydantic import Field


Age = Annotated[int, Field(gt=0)]

class UserDetail(BaseModel):
    age: Age
    name: FullName

try:
    person = UserDetail(age=-10, name="Jason")
except Exception as e:
    print(e)

2 validation errors for UserDetail
age
  Input should be greater than 0 [type=greater_than, input_value=-10, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/greater_than
name
  Value error, Name must contain a space. [type=value_error, input_value='Jason', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


## Providing Context


In [9]:
from pydantic import ValidationInfo

def message_cannot_have_blacklisted_words(v: str, info: ValidationInfo) -> str:
    blacklist = info.context.get("blacklist", [])
    for word in blacklist:
        assert word not in v.lower(), f"`{word}` was found in the message `{v}`"
    return v

ModeratedStr = Annotated[str, AfterValidator(message_cannot_have_blacklisted_words)]

class Response(BaseModel):
    message: ModeratedStr


try:
    Response.model_validate(
        {"message": "I will hurt them."},
        context={
            "blacklist": {
                "rob",
                "steal",
                "kill",
                "attack",
            }
        },
    )
except Exception as e:
    print(e)

## Using OpenAI Moderation


To enhance our validation measures, we'll extend the scope to flag any answer that contains hateful content, harassment, or similar issues. OpenAI offers a moderation endpoint that addresses these concerns, and it's freely available when using OpenAI models.


With the `instructor` library, this is just one function edit away:


In [10]:
from typing import Annotated
from pydantic import AfterValidator
from instructor import openai_moderation

import instructor
from openai import OpenAI

client = instructor.patch(OpenAI())

# This uses Annotated which is a new feature in Python 3.9
# To define custom metadata for a type hint.
ModeratedStr = Annotated[str, AfterValidator(openai_moderation(client=client))]


class Response(BaseModel):
    message: ModeratedStr


try:
    Response(message="I want to make them suffer the consequences")
except Exception as e:
    print(e)

1 validation error for Response
message
  Value error, `I want to make them suffer the consequences` was flagged for violence [type=value_error, input_value='I want to make them suffer the consequences', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


## General Validator


In [12]:
from instructor import llm_validator

HealthTopicStr = Annotated[
    str,
    AfterValidator(
        llm_validator(
            "don't talk about any other topic except health best practices and topics",
            client=client,
        )
    ),
]


class AssistantMessage(BaseModel):
    message: HealthTopicStr


AssistantMessage(
    message="I would suggest you to visit Sicily as they say it is very nice in winter."
)

🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/4357cfa3-17bf-415f-a966-a3fa0ccf030e


ValidationError: 1 validation error for AssistantMessage
message
  Assertion failed, The statement does not follow the rule of only discussing health best practices and topics. [type=assertion_error, input_value='I would suggest you to v...is very nice in winter.', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/assertion_error

### Avoiding hallucination with citations


When incorporating external knowledge bases, it's crucial to ensure that the agent uses the provided context accurately and doesn't fabricate responses. Validators can be effectively used for this purpose. We can illustrate this with an example where we validate that a provided citation is actually included in the referenced text chunk:

In [13]:
from pydantic import ValidationInfo

def citation_exists(v: str, info: ValidationInfo):
    context = info.context
    if context:
        context = context.get("text_chunk")
        if v not in context:
            raise ValueError(f"Citation `{v}` not found in text, only use citations from the text.")
    return v

Citation = Annotated[
    str,
    AfterValidator(citation_exists),
    WithJsonSchema({
        "type": "string",
        "description": "For every answer provide an exact substring match to the context"
    })
]


class AnswerWithCitation(BaseModel):
    answer: str
    citation: Citation

try:
    AnswerWithCitation.model_validate(
        {
            "answer": "Jason is cool",
            "citation": "Jason is a cool person",
        },
        context={"text_chunk": "Jason is just a normal guy"},
    )
except Exception as e:
    print(e)

1 validation error for AnswerWithCitation
citation
  Value error, Citation `Jason is a cool person` not found in text, only use citations from the text. [type=value_error, input_value='Jason is a cool person', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


Here we assume that there is a "text_chunk" field that contains the text that the model is supposed to use as context. We then use the `field_validator` decorator to define a validator that checks if the citation is included in the text chunk. If it's not, we raise a `ValueError` with a message that will be returned to the user.


If we want to pass in the context through the `chat.completions.create`` endpoint, we can use the `validation_context` parameter

```python
resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=AnswerWithCitation,
    messages=[
        {"role": "user", "content": f"Answer the question `{q}` using the text chunk\n`{text_chunk}`"},
    ],
    validation_context={"text_chunk": text_chunk},
)
```

In practice there are many ways to implement this: we could use a regex to check if the citation is included in the text chunk, or we could use a more sophisticated approach like a semantic similarity check. The important thing is that we have a way to validate that the model is using the provided context accurately.


## Reasking with validators

For most of these examples all we've done we've mostly only defined the validation logic. Which can be seperate from generation, however when we are given validation errors, we shouldn't end there! Instead instructor allows us to collect all the validation errors and reask the llm to rewrite their answer.

Lets try to use an extreme example to illustrate this point:


In [14]:
class QuestionAnswer(BaseModel):
    question: str
    answer: str


question = "What is the meaning of life?"
context = (
    "The according to the devil the meaning of life is a life of sin and debauchery."
)


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=QuestionAnswer,
    messages=[
        {
            "role": "system",
            "content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
        },
        {
            "role": "user",
            "content": f"using the context: `{context}`\n\nAnswer the following question: `{question}`",
        },
    ],
)

print(resp.model_dump_json(indent=2))

🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/8781ce01-6f57-4f86-9527-8e773ea761d2
{
  "question": "What is the meaning of life?",
  "answer": "According to the devil the meaning of life is a life of sin and debauchery."
}


In [15]:
from instructor import llm_validator


NotEvilAnswer = Annotated[
    str,
    AfterValidator(
        llm_validator("don't say objectionable things", client=client)
    ),
]


class QuestionAnswer(BaseModel):
    question: str
    answer: NotEvilAnswer


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=QuestionAnswer,
    max_retries=2,
    messages=[
        {
            "role": "system",
            "content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
        },
        {
            "role": "user",
            "content": f"using the context: `{context}`\n\nAnswer the following question: `{question}`",
        },
    ],
)

🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/668888ae-f117-432d-b7b8-d77a637d2f28
🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/e3c64170-9843-4c90-ab7b-33051e007596
🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/4d461b21-ff12-4bde-953e-000fb196ed17
🍩 https://wandb.ai/a-sh0ts/llmeng-1-nb4/r/call/bdc5ca0c-5852-4054-a846-053a96dd8251


InstructorRetryException: RetryError[<Future at 0x14773deb0 state=finished raised ValidationError>]

In [None]:
print(resp.model_dump_json(indent=2))