Let's start with the indexing stage

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector

# Load the document, split it inot chunks
raw_documents = TextLoader('../data/test.txt').load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                               chunk_overlap=200)
documents = text_splitter.split_documents(raw_documents)

# embed each chunk and insert it inot the vector store
model = OllamaEmbeddings(model="mxbai-embed-large")
connection = 'postgresql+psycopg://langchain:langchain@localhost:6024/langchain'
db = PGVector.from_documents(documents,model, connection=connection)

The indexing stage is complete. To perform the retrieval stage, we need to calculate similarity search calculations between the query and the documents, so relevant chunks from our indexed documented are retrieved. Retrieval process follows this approach:

1. Convert the user's query into embeddings.
2. Calculate the embeddings in the vector store that are most similar to the user's query.
3. Retrieve the relevant document embeddings and their corresponding text chunk.

In [3]:
# create retriever
retriever = db.as_retriever()

# fetch relevant documents
docs = retriever.invoke("""Who are the key figures in the ancient greek history of philosophy?""")

Here we are using *.as_retriever* method, this is used to abstract the logic of embedding the user's query and the underlysing similarity search calculations performed by the vector store to retrieve the relevant documents. There's also a argument *k*, that determines the number of relevant documents to fetch from the vector store. For example:

In [4]:
# create retriever with k=2
retriever = db.as_retriever(search_kwargs={"k":2})

# fetch the 2 most relevant documents
docs = retriever.invoke("""Who are the key figures in the ancient greek history 
    of philosophy?""")

Notice we have selected 2 as the value of **k**. This tells the vector store to return the two most relevant documents based on the user's query. Adding more documents can slower application perfomance, also it would make the prompt larger (also associated cost of generation) will be, and the greater the likelihodd of retrieving chunks of texts that contain irrelevant information, which could lead to LLM hallucinations.

## Generating LLM Predictions Using Relevant Documents
Once there are relevant documents based on the user's query, the final step is to add them to the original prompt as context and then invoke the model to generate a final output. As you can see in the following image:

![LLM flow with RAG](../img/LLMflow.png)

The following code will serve as an example:

In [5]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

retriever = db.as_retriever()

prompt = ChatPromptTemplate.from_template("""Answer the question based only on
                                          the following context:
                                          {context}
                                          
                                          Question: {question}
                                          """)

llm = ChatOllama(model="llama3.1", temperature=0)

chain = prompt | llm

# fetch relevant documents
# docs = retriever.get_relevant_documents("""What is the eternal voyage about?""")
# note this method was deprecated, suggestion is to use instead invoke method. Adjusting code
docs = retriever.invoke("""What is the eternal voyage about?""")

# run
chain.invoke({"context": docs, "question": """What is the eternal voyage about?"""})

AIMessage(content='The Eternal Voyage appears to be a poetic and metaphorical journey through time, exploring themes of fate, memory, loss, and the passage of time. The poem describes a vessel sailing across "endless seas of time," with a captain who navigates through "mist-clad isles of memory" and charts a course through "fate unknown." The poem also touches on the idea of empires rising and falling, and the fleeting nature of human endeavors.\n\nThe overall tone of the poem suggests that it is about the human experience of navigating through time, confronting the impermanence of things, and seeking to understand what lingers beyond death. The poem\'s use of imagery and symbolism creates a sense of mystery and wonder, inviting the reader to reflect on their own place in the passage of time.\n\nWithout more context or information about the author\'s intentions, it is difficult to provide a definitive interpretation of the Eternal Voyage. However, based on the language and themes used 

A brief explanation of the code above:
- We implemented a dynamic *context* and *question* variables into our prompt, which allow us to define a *ChatPromptTemplate* the model can use to generate responses.
- We define a *ChatOllama* interface to act as our LLM. Temperature is set to 0 to eliminate the creativity in outputs from the model.
- We create a chain to compose the prompt and LLM.
- We *invoke* the cain passing in the *context* variable (our retrieved relevant docs) and the user's question to generate a final input.

We can encapsulate this retrieeval logic in a single function

In [6]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain

retriever = db.as_retriever()

prompt = ChatPromptTemplate.from_template("""Answer the question based only on
                                            the following context:
                                          {context}

                                          Question: {question}
                                          """)

llm = ChatOllama(model="llama3.1", temperature=0)

@chain
def qa(input: str):
    # fetch the relevant documents
    docs = retriever.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer.content

#run
qa.invoke("What is the eternal voyage about?")

'The Eternal Voyage appears to be a poetic and metaphorical journey through time, exploring themes of fate, memory, loss, and the passage of time. The poem describes a vessel sailing across "endless seas of time," with a captain who navigates through "mist-clad isles of memory" and charts a course through "fate unknown." The poem also touches on the idea of empires rising and falling, and the fleeting nature of human endeavors.\n\nThe overall tone of the poem suggests that it is about the human experience of navigating through time, confronting the impermanence of things, and seeking to understand what lingers beyond death. The poem\'s use of imagery and symbolism creates a sense of mystery and wonder, inviting the reader to reflect on their own place in the passage of time.\n\nThe repetition of similar lines across different documents suggests that this is a central theme or idea being explored, rather than a specific narrative or plot.'

Now we have a runnable **qa** function, that can be called with just a question and takes care to first fetch the relevant docs for context, format them into the prompt, and finally generate the answer. *@chain* decorator turns the function into a runnable chain. This notion of encapsulating multiple steps into a single function is key to building interesting apps with LLMs.

In [7]:
@chain
def qa(input: str):
    # fetch relevant documents
    docs = retriever.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return {"answer": answer, "docs": docs}

#run
qa.invoke("What is the eternal voyage about?")

{'answer': AIMessage(content='The Eternal Voyage appears to be a poetic and metaphorical journey through time, exploring themes of fate, memory, loss, and the passage of time. The poem describes a vessel sailing across "endless seas of time," with a captain who navigates through "mist-clad isles of memory" and charts a course through "fate unknown." The poem also touches on the idea of empires rising and falling, and the fleeting nature of human endeavors.\n\nThe overall tone of the poem suggests that it is about the human experience of navigating through time, confronting the impermanence of things, and seeking to understand what lingers beyond death. The poem\'s use of imagery and symbolism creates a sense of mystery and wonder, inviting the reader to reflect on their own place in the passage of time.\n\nThe repetition of similar lines across different documents suggests that this is a central theme or idea being explored, rather than a specific narrative or plot.', additional_kwargs

This is a basic RAG system to power an AI app for personal use. However a production ready AI app used by multiple users requires a more advanced RAG system. To do so we require the following:

1. How do we handle the variability in the quality of a user's input?
2. How do we route queries to retrieve relevant data from a variety of data sources?
3. How do we transform natural language to the query language of the target data source?
4. How do we optimize our indexing process, i.e., embedding, text splitting?

Let's dive into one of the relevant topics regarding advanced RAG systems:

## Query Transformation
One of the main issues that RAG systems have is that they relay heavily on the quality of a user input. In a production environment, is likely that a user input is incomplete, ambiguous, or poorly worder manner that leads to model hallucination.

Query transformation is a subset of strategies designed to modify the user's input to answer the first question: *How do we handle the variability in the quality of a user's input?*

One strategy could be as follows:

### Rewrite-Retrieve-Read
Basically this strategy involves LLM to rewrite the user's input before performing the retrieval. First let's see a poor written prompt.

In [8]:
@chain
def qa(input):
    # fetch relevant documents
    docs = retriever.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

qa.invoke("""Today I woke up and brushed my teeth, then I sat down to read the 
    news. But then I forgot the food on the cooker. What is the eternal voyage about?""")

AIMessage(content='The context provided doesn\'t mention anything about your daily activities or the news. It appears to be a collection of documents with poetic content.\n\nAs for "The Eternal Voyage," it seems to be a poem that describes a journey through time, space, and memory. The poem uses metaphors and imagery to convey themes of exploration, loss, and the passage of time.', additional_kwargs={}, response_metadata={'model': 'llama3.1', 'created_at': '2025-03-29T18:50:23.9509807Z', 'done': True, 'done_reason': 'stop', 'total_duration': 37413659200, 'load_duration': 8013306500, 'prompt_eval_count': 1181, 'prompt_eval_duration': 14966000000, 'eval_count': 74, 'eval_duration': 14433000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-06a5c26a-67b9-4e98-aabf-90c53c706819-0', usage_metadata={'input_tokens': 1181, 'output_tokens': 74, 'total_tokens': 1255})

The model failed to answer because it got distracted by irrelevant information provided in the user's query. Let's implement the Rewrite-Retrieve-Read prompt.  

In [9]:
rewrite_prompt = ChatPromptTemplate.from_template("""Provide a better search 
    query for web search engine to answer the given question, end the queries 
    with ’**’. Question: {x} Answer:""")

def parse_rewriter_output(message):
    return message.content.strip('"').strip("**")

rewriter = rewrite_prompt | llm | parse_rewriter_output

@chain
def qa_rrr(input):
    # rewrite the query
    new_query = rewriter.invoke(input)
    # fetch relevant documents
    docs = retriever.invoke(new_query)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

# run
qa_rrr.invoke("""Today I woke up and brushed my teeth, then I sat down to read the 
    news. But then I forgot the food on the cooker. Nevermind, What is the eternal voyage poem about?""")

AIMessage(content='The Eternal Voyage poem appears to be a metaphorical and poetic piece that explores themes of time, memory, fate, and the human experience. The poem describes a vessel sailing through "endless seas of time," where echoes fade and stars align.\n\nBased on the content, it seems that the poem is about a journey through the past, present, and future, where the speaker navigates through memories, lost moments, and forgotten eras. The poem touches on ideas of impermanence, the passage of time, and the human condition.\n\nThe poem\'s structure and language suggest a sense of longing, nostalgia, and wonder. The speaker seems to be drawn to the idea of exploring the unknown, charting a course through fate, and learning from the past.\n\nSome possible interpretations of the poem include:\n\n* A reflection on the fleeting nature of time and human experience\n* An exploration of the power of memory and its relationship to our understanding of ourselves and the world around us\n*

Notice that we had an LLM rewriter the user's initial distracted query into a much clearer one, and it is that more focused query that is passed to the retriever to fetch the most relevant documents. 

> **Note:** This technique can be used with any retrieval method, , be that a vector store such as we have here or, for instance, a web search tool. The downside of this approach is that it introduces additional latency into our chain, because now we need to perform 2 LLM calls in sequence.

### Multi-Query Retrieval
A user's single query can be insufficient to capture the full scope of information required to answer the query comprehensively.The multiquery retrieval strategy resolves this problem by instructing an LLM to generate multiple queries based on a user's initial query, executing a parallel retrieval of each query from the data source and then insterting the retrieved results as prompt context to generate a final model output. Here you can see an image of the process:

![Multi-Query Retrieval](../img/Multi-Query-Retrieval.png)

This strategy is helpful when a single question may rely on multiple perspectives to provide a comprehensive answer. Here is a code eample of multi-query retrieval:

In [10]:
from langchain.prompts import ChatPromptTemplate

perspectives_prompt = ChatPromptTemplate.from_template("""You are an AI language 
    model assistant. Your task is to generate five different versions of the 
    given user question to retrieve relevant documents from a vector database. 
    By generating multiple perspectives on the user question, your goal is to 
    help the user overcome some of the limitations of the distance-based 
    similarity search. Provide these alternative questions separated by 
    newlines. Original question: {question}""")

def parse_queries_output(message):
    return message.content.split('\n')

query_gen = perspectives_prompt | llm | parse_queries_output

This prompt template is aimed at generating variations of questions based on the user's initial query. We then take the list of generated queries, retrieve the most relevant docs for each of them in parallel, and then combine to get the unique union of all the retrieved relevant documents:

In [11]:
def get_unique_union(document_lists):
    # Flatten list of lists, and dedupe them
    deduped_docs = {
        doc.page_content: doc
        for sublist in document_lists for doc in sublist
    }

    # return a flat list of unique docs
    return list(deduped_docs.values())

retrieval_chain = query_gen | retriever.batch | get_unique_union

Because we're retrieving documents from the same retriever with multiple (related) queries, it's likely at least some of them are repeated. Before using them as context to answer the question, we need to deduplicate them, to end up with a single instance of each. Here we dedupe docs by using their content (a string) as the key in a dictionary, because a dictionary only contain one entry for each key. After we've iterated through all docs, we simply get all the dictionary values, which is now free of duplicates.

Notice that we are using **.batch** as well, which runs all generated queries in parallel and returns a list of results-in this case, a list of lists of documents, which we then flatten and dedupe.

The final step is to construct a prompt, including the user's question and combined retrieved relevant documents, and a model interface to generate the prediction.

In [12]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based
                                          on this context:
                                          
                                          {context}
                                          
                                          Question: {question}
                                          """)

@chain
def multi_query_qa(input):
    # fetch relevant documents
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

# run
multi_query_qa.invoke("""What is the eternal voyage poem about?
""")

AIMessage(content='The "Eternal Voyage" poem appears to be a continuation of the themes and motifs introduced in the previous poem, "The Forest of Forgotten Names". It seems to explore the idea of a journey through time, space, and memory.\n\nAt its core, the poem is about a vessel (representing the human soul or spirit) that sails across the "endless seas of time", navigating through the echoes of the past, guided by the light of memories and the whispers of those who have come before. The captain of this vessel is on a quest to chart a course through fate unknown, using only the stars as his guide.\n\nThe poem touches on various themes, including:\n\n1. **Loss and remembrance**: The poem acknowledges that things are lost in time, but also suggests that memories linger, waiting to be rediscovered.\n2. **Journeying through time**: The vessel\'s journey across the seas of time implies a exploration of the past, present, and future.\n3. **The power of memory**: Memories are portrayed as 

This chain isn't too different from previous chains, as all the new logic for multi-query retrieval is contained in **retrieval_chain**- This is key to making good use of these techniques as a standalone chain (in this case, **retrieval_chain**), which makes it easy to adopt them and enven to combine them.

## RAG-Fusion
This strategy shares similarities with the multi-query retrieval strategy, except we will apply a final reranking step to all the retrieved docuemtns. This reranking steps use *reciprocal rank fusion* (RRF) algorithm, which involves combining the ranks of different search results to produce a single, unified ranking. By combining ranks from different queries, we pull the most relevant documents to the top of the final list. RRF is well-suited for combining results from queries that migh have different scales or distributions of scores. Let's see an example:

In [13]:
from langchain.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama

propmt_rag_fusion = ChatPromptTemplate.from_template("""You are a helpful
                                                     assistant that generates multiple search queries based on a single input
                                                     query. \n
                                                     Generate multiple search queries related to: {question} \n
                                                     Output (4 queries):""")

def parse_queries_output(message):
    return message.content.split('\n')

llm = ChatOllama(model="llama3.1", temperature=0)

query_gen = propmt_rag_fusion | llm |   parse_queries_output

Once we've generated our queries, we fetch relevant documents for each query and pass them into a function to *rerank* (*reorder* according to relevancy) the final list of relevant docuemtns. The function **recirpocal_rank_fusion** takes a list of the search resultas of each query, so a list of lists of documents, where each inner list of documents is sorted by their relevance so that query. The RRF algorithm then calculates a new score for each document based on its ranks (or positions) in the different lists and sorts them to create a final reranked list.

After calculating the fused scores, the function sorts the documents in descending order of these scores to get the final reranked list, which is then returned.

In [14]:
def recicprocal_rank_fusion(results: list[list], k=60):
    """"reciprocal rank fusion on multiple lists of ranked documents
    and an optimal parameter k used in the RRF formula
    """
    # Initialize a dictionary to hold fused scores for each document
    # Documents will be keyed by their contents to ensure uniqueness
    fused_scores = {}
    documents = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each docuemnt in the list,
        # with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Use the document contents as the key for uniqueness
            doc_str = doc.page_content
            # If the document hasn't been seen yet,
            # - initialize score to 0
            # - save it for later
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
                documents[doc_str] = doc
            # Update the socre of the document using the RRF formula
            # 1 / (rank + k) 
            fused_scores[doc_str] += 1 / (rank + k)
    
    # Sort the documents based on their fused scores in descending order
    # to get the final reranked results
    reranked_docs_strs = sorted(
        fused_scores, key=lambda d: fused_scores[d], reverse=True
    )

    # retrieve the corresponding doc for each doc_str
    return [
        documents[doc_str]
        for doc_str in reranked_docs_strs
    ]

retrieval_chain = query_gen | retriever.batch | recicprocal_rank_fusion

Notice that the function also takes a *k* parameter, used to determine how much influence documents in each query's result sets have over the final list of documents. A *higher value of k* means that lower-ranked documents have more influence.

Finally, we combine our new retrieval chain (now using RRF) with the full chain we've seen before:

In [15]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based
                                          on this context:
                                          
                                          {context}
                                          
                                          Question: {question}
                                          """)
llm = ChatOllama(model="llama3.1", temperature=0, num_predict=100)

@chain
def multi_query_qa(input):
    # fetch relevant documents
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

multi_query_qa.invoke("What is the eternal voyage poem about?")


AIMessage(content='The "Eternal Voyage" poem appears to be a philosophical and metaphorical exploration of the human experience, mortality, and the passage of time. On its surface, it describes a journey across the seas of time, where a vessel sails with silver light, guided by a captain who is lost in whispers and sound.\n\nHowever, upon closer reading, the poem reveals themes of:\n\n1. **The fleeting nature of life**: The poem touches on the idea that everything is transient, including human existence. The', additional_kwargs={}, response_metadata={'model': 'llama3.1', 'created_at': '2025-03-29T18:55:43.5837186Z', 'done': True, 'done_reason': 'length', 'total_duration': 37282589300, 'load_duration': 7778347100, 'prompt_eval_count': 895, 'prompt_eval_duration': 10809000000, 'eval_count': 100, 'eval_duration': 18694000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-3309e9a4-1cdb-4c8c-a76d-85991a3aff35-0', usage_metadata={'input_tokens': 895

RAG-Fusion's strength lies in its ability to capture the user's intended expression, navigate complex queries, and broaden the scope of retrieved documents, enabling serendipitous discovery.

## Hypothetical Document Embeddings
*Hypothetical Document Embeddings (HyDE)* is a strategy that involves creating a hyphotetical document based on the user's query, embedding the document, and retrieving relevant documents based on vector similarity. The intuition behind HyDE is that an LLM-generated hypothetical document will be more similar to the most relevant documents than the original query. Let's see an example:

In [16]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama

prompt_hyde = ChatPromptTemplate.from_template("""Please write a passage to
                                               answer the question. \n Question: {question} \n Passage:""")

generate_doc = (
    prompt_hyde | ChatOllama(model="llama3.1", temperature=0, num_predict=100) | StrOutputParser()
)

Next, we take the hypothetical document and use it as an input to the **retriever**, which will generate its embedding and search for similar documents in the vector store:

In [17]:
retrieval_chain = generate_doc | retriever

Finally, we take the retrieved documents, pass them as context to the final prompt, and instruct the model to generate an output:

In [18]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based on this context:
                                          
                                          {context}
                                          
                                          Question: {question}""")

llm = ChatOllama(model="llama3.1", temperature=0, num_predict=100)

@chain
def qa(input):
    # fetch relevant documents from the hyde retrieval chain defined earlier
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

qa.invoke("What is the eternal voyage poem about?")

AIMessage(content='The "Eternal Voyage" poem appears to be a philosophical and metaphorical exploration of time, fate, memory, and the human experience. On its surface, it describes a journey across the seas of time, where a vessel sails with silver light, guided by a captain who is lost in whispers and sound.\n\nHowever, upon closer reading, the poem reveals themes of:\n\n1. **The passage of time**: The poem acknowledges that echoes fade, stars align, and empires rise and fall, emphasizing', additional_kwargs={}, response_metadata={'model': 'llama3.1', 'created_at': '2025-03-29T19:27:20.4617645Z', 'done': True, 'done_reason': 'length', 'total_duration': 42287971500, 'load_duration': 8060673000, 'prompt_eval_count': 1152, 'prompt_eval_duration': 13889000000, 'eval_count': 100, 'eval_duration': 20336000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-79126f6f-ef28-45b1-a57c-709567794bc3-0', usage_metadata={'input_tokens': 1152, 'output_tokens

At glance **query transformation** consists of taking the user's original query and:
1. Rewrite it into one or more queries.
2. Combining the results of those queries into a single set of the most relevant results.

This can take several forms, usually the process is to use the user's original query and ask the LLM model to write a new query or queries. Some examples of typical changes made are:

- Removing irrelevant/unrelated text from the query.
- Grounding the query with past conversation history. 
- Casting a wider net for relevant documents by also fetching documents for related queries.
- Decomposing a complex question into multiple, simpler questions and then including results for all of them in the final prompt to generate an answer.

The right rewritting strategy to use will depend on your use case.

In order to develop robust RAG system, we also need to route queries to retrieve relevant data from multiple data sources, for that we use Query Routing. Let's go back to [README file](../README.md/#)