# Applying Structured Output to RAG applications 

**What is RAG?**

Retrieval Augmented Generation (RAG) models are the bridge between large language models and external knowledge databases. They fetch the relevant data for a given query. For example, if you have some documents and want to ask questions related to the content of those documents, RAG models help by retrieving data from those documents and passing it to the LLM in queries.

**How do RAG models work?**

The typical RAG process involves embedding a user query and searching a vector database to find the most relevant information to supplement the generated response. This approach is particularly effective when the database contains information closely matching the query but not more than that.

![Image](https://jxnl.github.io/instructor/blog/img/dumb_rag.png)

**Why is there a need for them?**

Pre-trained large language models do not learn over time. If you ask them a question they have not been trained on, they will often hallucinate. Therefore, we need to embed our own data to achieve a better output.

## Simple RAG

**What is it?**

The simplest implementation of RAG embeds a user query and do a single embedding search in a vector database, like a vector store of Wikipedia articles. However, this approach often falls short when dealing with complex queries and diverse data sources.

**What are the limitations?**

- **Query-Document Mismatch:** It assumes that the query and document embeddings will align in the vector space, which is often not the case.
    - Query: "Tell me about climate change effects on marine life."
    - Issue: The model might retrieve documents related to general climate change or marine life, missing the specific intersection of both topics.
<!-- blank -->
- **Monolithic Search Backend:** It relies on a single search method and backend, reducing flexibility and the ability to handle multiple data sources.
    - Query: "Latest research in quantum computing."
    - Issue: The model might only search in a general science database, missing out on specialized quantum computing resources.
<!-- blank -->
- **Text Search Limitations:** The model is restricted to simple text queries without the nuances of advanced search features.
    - Query: "what problems did we fix last week"
    - Issue: cannot be answered by a simple text search since documents that contain problem, last week are going to be present at every week.
<!-- blank -->
- **Limited Planning Ability:** It fails to consider additional contextual information that could refine the search results.
    - Query: "Tips for first-time Europe travelers."
    - Issue: The model might provide general travel advice, ignoring the specific context of first-time travelers or European destinations.


## Improving the RAG model

**What's the solution?**

Enhancing RAG requires a more sophisticated approach known as query understanding.

This process involves analyzing the user's query and transforming it to better match the backend's search capabilities.

By doing so, we can significantly improve both the precision and recall of the search results, providing more accurate and relevant responses.

![Image](https://jxnl.github.io/instructor/blog/img/query_understanding.png)


## Practical Examples

In the examples below, we're going to use the [`instructor`](https://github.com/jxnl/instructor) library to simplify the interaction between the programmer and language models via the function-calling API.


In [16]:
import instructor 

from openai import OpenAI
from typing import List
from pydantic import BaseModel, Field

client = instructor.patch(OpenAI())

### Example 1) Improving Extractions

One of the big limitations is that often times the query we embed and the text 
A common method of using structured output is to extract information from a document and use it to answer a question. Directly, we can be creative in how we extract, summarize and generate potential questions in order for our embeddings to do better. 

For example, instead of using just a text chunk we could try to:

1. extract key words and themes
2. extract hypothetical questions
3. generate a summary of the text

In the example below, we use the `instructor` library to extract the key words and themes from a text chunk and use them to answer a question.

In [17]:
class Extraction(BaseModel):
    summary: str 
    hypothetical_questions: List[str] = Field(default_factory=list, description="Hypothetical questions that this document could answer")
    keywords: List[str] = Field(default_factory=list, description="Keywords that this document is about")

In [15]:
text_chunk = """
## Simple RAG

**What is it?**

The simplest implementation of RAG embeds a user query and do a single embedding search in a vector database, like a vector store of Wikipedia articles. However, this approach often falls short when dealing with complex queries and diverse data sources.

**What are the limitations?**

- **Query-Document Mismatch:** It assumes that the query and document embeddings will align in the vector space, which is often not the case.
    - Query: "Tell me about climate change effects on marine life."
    - Issue: The model might retrieve documents related to general climate change or marine life, missing the specific intersection of both topics.
- **Monolithic Search Backend:** It relies on a single search method and backend, reducing flexibility and the ability to handle multiple data sources.
    - Query: "Latest research in quantum computing."
    - Issue: The model might only search in a general science database, missing out on specialized quantum computing resources.
- **Text Search Limitations:** The model is restricted to simple text queries without the nuances of advanced search features.
    - Query: "what problems did we fix last week"
    - Issue: cannot be answered by a simple text search since documents that contain problem, last week are going to be present at every week.
- **Limited Planning Ability:** It fails to consider additional contextual information that could refine the search results.
    - Query: "Tips for first-time Europe travelers."
    - Issue: The model might provide general travel advice, ignoring the specific context of first-time travelers or European destinations.
"""

extraction = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=Extraction,
    messages=[{
        "role": "system", 
        "content": "Your role is to extract data from the following text chunk and create a RAG document."
    }, {
        "role": "user", 
        "content": text_chunk
    }])


print(extraction.model_dump_json(indent=2))

{
  "summary": "Simple RAG, an approach to query embedding and document retrieval, often falls short due to misalignment between query and document embeddings, reliance on a single search method and backend, text search limitations, and inadequate consideration of contextual information.",
  "hypothetical_questions": [
    "How can RAG be improved to better handle complex queries?",
    "What other methods exist to enhance the alignment of queries and documents in vector spaces?",
    "How can multiple data sources be incorporated into the search mechanism of RAG?",
    "In what ways can advanced search features be integrated into RAG?",
    "What approaches enable RAG to use contextual information for improved search results?"
  ],
  "keywords": [
    "RAG",
    "query embedding",
    "document retrieval",
    "vector database",
    "query-document mismatch",
    "monolithic search backend",
    "text search limitations",
    "planning ability",
    "contextual information"
  ]
}


Now you can imagine if you were to embed the summaries, hypothetical questions, and keywords in a vector database, you can then use a vector search to find the best matching document for a given query. What you'll find is that the results are much better than if you were to just embed the text chunk! 

### Example 2) Understanding 'recent queries' to add temporal context

One common application of using structured outputs for query understanding is to identify the intent of a user's query. In this example we're going to use a simple schema to seperately process the query to add additional temporal context.


In [25]:
from datetime import date

class DateRange(BaseModel):
    start: date
    end: date

class Query(BaseModel):
    rewritten_query: str
    published_daterange: DateRange

In this example, `DateRange` and `Query` are Pydantic models that structure the user's query with a date range and a list of domains to search within.

These models **restructure** the user's query by including a <u>rewritten query</u>, a <u>range of published dates</u>, and a <u>list of domains</u> to search in.

Using the new restructured query, we can apply this pattern to our function calls to obtain results that are optimized for our backend.

In [26]:
query = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=Query,
    messages=[
        {
            "role": "system", 
            "content": f"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ..."
        },
        {
            "role": "user", 
            "content": "query: What are some recent developments in AI?"
        }
    ],
)

print(query.model_dump_json(indent=4)) # Printing the Json dump of the model

{
    "rewritten_query": "recent developments in Artificial Intelligence",
    "published_daterange": {
        "start": "2023-04-01",
        "end": "2023-11-18"
    }
}


This isn't just about adding some date ranges. We can even use some chain of thought prompting to generate tailored searches that are deeply integrated with our backend. 

In [23]:
class DateRange(BaseModel):
    chain_of_thought: str = Field(
        description="Think step by step to plan what is the best time range to search in"
    )
    start: date
    end: date

class Query(BaseModel):
    rewritten_query: str
    published_daterange: DateRange


query = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=Query,
    messages=[
        {
            "role": "system", 
            "content": f"You're a query understanding system for a search engine. Today is {date.today()}."
        },
        {
            "role": "user", 
            "content": "What are some recent developments in AI?"
        }
    ],
)

print(query.model_dump_json(indent=4)) # Printing the Json dump of the model

{
    "rewritten_query": "recent developments in artificial intelligence",
    "published_daterange": {
        "chain_of_thought": "Given that it's currently late 2023, a recent timeframe would ideally be within the last year to ensure the developments are current. Therefore, a suitable date range for recent AI developments could start from late 2022 to the present date in 2023.",
        "start": "2022-11-18",
        "end": "2023-11-18"
    }
}


### Example 3) Personal Assistants, parallel processing

A personal assistant application needs to interpret vague queries and fetch information from multiple backends, such as emails and calendars. By modeling the assistant's capabilities using Pydantic, we can dispatch the query to the correct backend and retrieve a unified response.

For instance, when you ask, "What's on my schedule today?", the application needs to fetch data from various sources like events, emails, and reminders. This data is stored across different backends, but the goal is to provide a consolidated summary of results.

It's important to note that the data from these sources may not be embedded in a search backend. Instead, they could be accessed through different clients like a calendar or email, spanning both personal and professional accounts.


In [27]:
from typing import Literal

class SearchClient(BaseModel):
    query: str
    keywords: List[str]
    email: str
    source: Literal["gmail", "calendar"] 
    date_range: DateRange

class Retrival(BaseModel):
    queries: List[SearchClient]

Now, we can utilize this with a straightforward query such as "What do I have today?".

The system will attempt to asynchronously dispatch the query to the appropriate backend.

However, it's still crucial to remember that effectively prompting the language model is still a key aspect.


In [28]:
retrival = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=Retrival,
    messages=[
        {"role": "system", "content":f"You are Jason's personal assistant. Today is {date.today()}"},
        {"role": "user", "content": "What do I have today?"}
    ],
)
print(retrival.model_dump_json(indent=4))

{
    "queries": [
        {
            "query": "schedule",
            "keywords": [
                "appointments",
                "meetings",
                "schedule",
                "events"
            ],
            "email": "jason.assistant@busybot.com",
            "source": "calendar",
            "date_range": {
                "start": "2023-11-18",
                "end": "2023-11-18"
            }
        }
    ]
}


To make it more challenging, we will assign it multiple tasks, followed by a list of queries that are routed to various search backends, such as email and calendar. Not only do we dispatch to different backends, over which we have no control, but we are also likely to render them to the user in different ways.

In [29]:
retrival = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=Retrival,
    messages=[
        {"role": "system", "content": f"You are Jason's personal assistant. Today is {date.today()}"},
        {"role": "user", "content": "What meetings do I have today and are there any important emails I should be aware of?"}
    ],
)
print(retrival.model_dump_json(indent=4))

{
    "queries": [
        {
            "query": "meetings",
            "keywords": [
                "meetings",
                "appointments",
                "schedule",
                "calendar"
            ],
            "email": "user@email.com",
            "source": "calendar",
            "date_range": {
                "start": "2023-11-18",
                "end": "2023-11-18"
            }
        },
        {
            "query": "important emails",
            "keywords": [
                "important",
                "priority",
                "urgent",
                "follow-up"
            ],
            "email": "user@email.com",
            "source": "gmail",
            "date_range": {
                "start": "2023-11-18",
                "end": "2023-11-18"
            }
        }
    ]
}


### Example 4) Decomposing questions 

Lastly, a lightly more complex example of a problem that can be solved with structured output is decomposing questions. Where you ultimately want to decompose a question into a series of sub-questions that can be answered by a search backend. For example 

"Whats the difference in populations of jason's home country and canada?"

You'd ultimately need to know a few things

1. Jason's home country
2. The population of Jason's home country
3. The population of Canada
4. The difference between the two

This would not be done correctly as a single query, nor would it be done in parallel, however there are some opportunities try to be parallel since not all of the sub-questions are dependent on each other.

In [31]:
class Question(BaseModel):
    id: int = Field(..., description="A unique identifier for the question")
    query: str = Field(..., description="The question decomposited as much as possible")
    subquestions: List[int] = Field(default_factory=list, description="The subquestions that this question is composed of")


class QueryPlan(BaseModel):
    root_question: str = Field(..., description="The root question that the user asked")
    plan: List[Question] = Field(..., description="The plan to answer the root question and its subquestions")


retrival = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_model=QueryPlan,
    messages=[
        {"role": "system", "content":"You are a query understanding system capable of decomposing a question into subquestions."},
        {"role": "user", "content": "What is the difference between the population of jason's home country and canada?"}
    ],
)

print(retrival.model_dump_json(indent=4))

{
    "root_question": "What is the difference between the population of jason's home country and canada?",
    "plan": [
        {
            "id": 1,
            "query": "What is Jason's home country?",
            "subquestions": []
        },
        {
            "id": 2,
            "query": "What is the population of Canada?",
            "subquestions": []
        },
        {
            "id": 3,
            "query": "What is the population of {Jason's home country}?",
            "subquestions": [
                1
            ]
        },
        {
            "id": 4,
            "query": "What is the difference between the population of {Jason's home country} and the population of Canada?",
            "subquestions": [
                2,
                3
            ]
        }
    ]
}


I hope in this section I've exposed you to some ways we can be creative in modeling structured outputs to leverage LLMS in building some lightweight components for our systems.