# Build an Advanced RAG App: Query Rewriting

This notebook gives a step-by-step example for the article "Build an Advanced RAG App: Query Rewriting", available at https://ruxu.dev. The purpose of this example is to showcase a very query rewriting techniques in a RAG pipeline.

We will showcase different strategies to perform query rewriting on an input query of a RAG pipeline. These strategies will go from the most basic to some advanced ones. Then, we will also compare the results of a RAG pipeline with and without query rewriting.

First, we will install required dependencies.

In [None]:
!pip install langchain langchain-community pypdf sentence_transformers faiss-cpu langchain-anthropic

Collecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━

For this examples I'll be using Anthropic's Claude 3.5 Sonnet model.

> In order for it to work, remember to set the Secret Variable "ANTHROPIC_API_KEY" to your own Anthropic API Key, or change the model to any of your choice.

In [None]:
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate, FewShotChatMessagePromptTemplate
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from google.colab import userdata

api_key = userdata.get('ANTHROPIC_API_KEY')
model = ChatAnthropic(model='claude-3-5-sonnet-20240620', api_key=api_key)


embeddings_model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings_model = HuggingFaceBgeEmbeddings(
    model_name=embeddings_model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Zero-shot Query Rewriting
This is simple query rewriting. Zero-shot refers to the prompt engineering technique of giving examples of the task to the LLM, which in this case we give none.

In [None]:
system_rewrite = """You are a helpful assistant that generates multiple search queries based on a single input query.

Perform query expansion. If there are multiple common ways of phrasing a user question
or common synonyms for key words in the question, make sure to return multiple versions
of the query with the different phrasings.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Return 3 different versions of the question."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_rewrite),
        ("human", "{question}"),
    ]
)

chain = prompt | model

response = chain.invoke({
    "question": "Which food items does this recipe need?"
})

response

AIMessage(content='Here are 3 expanded versions of the query:\n\n1. What ingredients are required for this recipe?\n\n2. What are the necessary food components for this dish?\n\n3. List the food items needed to make this recipe.', response_metadata={'id': 'msg_01Wvbb5NHKim5RL1o8KChD8B', 'model': 'claude-3-5-sonnet-20240620', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 115, 'output_tokens': 51}}, id='run-4f3c8e43-ca9f-4963-90d9-5101f3425583-0', usage_metadata={'input_tokens': 115, 'output_tokens': 51, 'total_tokens': 166})

## Few-shot Query Rewriting

For a slightly better result at the cost of using a few more tokens per rewrite, we can give some examples of how we want the rewrite to be done.

In [None]:
examples = [
    {
        "question": "How tall is the Eiffel Tower? It looked so high when I was there last year",
        "answer": "What is the height of the Eiffel Tower?"
    },
    {
        "question": "1 oz is 28 grams, how many cm is 1 inch?",
        "answer": "Convert 1 inch to cm."
    },
    {
        "question": "What's the main point of the article? What did the author try to convey?",
        "answer": "What is the main key point of this article?"
    }
]
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{question}"),
        ("ai", "{answer}"),
    ]
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples
)
final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_rewrite),
        few_shot_prompt,
        ("human", "{question}"),
    ]
)

chain = final_prompt | model
response = chain.invoke({
    "question": "Which food items does this recipe need?"
})
response

AIMessage(content='Here are 3 expanded versions of the query:\n\n1. What ingredients are required for this recipe?\n\n2. List all the food items needed to make this dish.\n\n3. What are the necessary components for preparing this recipe?', response_metadata={'id': 'msg_01TwfyNcp1WunkxnvCFgVVtC', 'model': 'claude-3-5-sonnet-20240620', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 221, 'output_tokens': 52}}, id='run-21b6f70c-4290-4db1-8e3c-e328ff90cca0-0', usage_metadata={'input_tokens': 221, 'output_tokens': 52, 'total_tokens': 273})

## Trainable rewriter

We can fine-tune a pre-trained model to perform the query rewriting task. Instead of relying on examples, we can teach it how query rewriting should be done to achieve the best results in context retrieving. Also, we can further train it using Reinforcement Learning so it can learn to recognize problematic queries and avoid toxic and harmful phrases.

Or we can also use an open-source model that has already been trained by somebody else on the task of query rewriting.

## Sub-queries

If the user query contains multiple questions, this can make context retrieval tricky. Each question probably needs different information, and we are not going to get all of it using all the questions as basis for information retrieval. To solve this problem, we can decompose the input into multiple sub-queries, and perform retrieval for each of the sub-queries.

In [None]:
system_decompose = """You are a helpful assistant that generates search queries based on a single input query.

Perform query decomposition. Given a user question, break it down into distinct sub questions that
you need to answer in order to answer the original question.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_decompose),
        ("human", "{question}"),
    ]
)

chain = prompt | model

response = chain.invoke({
    "question": """Which is the most popular programming language for machine learning and
is it the most popular programming language overall?"""
})

response

AIMessage(content='To answer this question, we need to break it down into the following sub-questions:\n\n1. What are the most popular programming languages for machine learning?\n2. Which one among these is considered the most popular for machine learning?\n3. What are the most popular programming languages overall?\n4. How does the most popular language for machine learning compare to the most popular language overall?', response_metadata={'id': 'msg_015ua5qkd5APXbAU8APdgys1', 'model': 'claude-3-5-sonnet-20240620', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 101, 'output_tokens': 83}}, id='run-b13038bb-3820-4470-b00b-ea6773deb07c-0', usage_metadata={'input_tokens': 101, 'output_tokens': 83, 'total_tokens': 184})

## Step-back prompt

Many questions can be a bit too complex for the RAG pipeline’s retrieval to grasp the multiple levels of information needed to answer them. For these cases, it can be helpful to generate multiple additional queries to use for retrieval. These queries will be more generic than the original query. This will enable the RAG pipeline to retrieve relevant information on multiple levels.

In [None]:
system_step_back = """You are an expert at taking a specific question and extracting a more generic question that gets at
the underlying principles needed to answer the specific question.

Given a specific user question, write a more generic question that needs to be answered in order to answer the specific question.

If you don't recognize a word or acronym to not try to rewrite it.

Write concise questions."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_step_back),
        ("human", "{question}"),
    ]
)

chain = prompt | model

response = chain.invoke({
    "question": """Which is the most popular programming language for machine learning?"""
})

response

AIMessage(content='What are the most commonly used programming languages in the field of artificial intelligence and data science?', response_metadata={'id': 'msg_012CxLK8dgbkWMbNm8e5xhBQ', 'model': 'claude-3-5-sonnet-20240620', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 98, 'output_tokens': 21}}, id='run-827d9a2d-2825-4138-bf51-5b055313ade8-0', usage_metadata={'input_tokens': 98, 'output_tokens': 21, 'total_tokens': 119})

## HyDE

Another method to improve how queries are matched with contexts chunks are Hypothetical Document Embeddings or HyDE. Sometimes, questions and answers are not that semantically similar, which can cause the RAG pipeline to miss critical context chunks in the retrieval stage. However, even if the query is semantically different, a response to the query should be semantically similar to another response to the same query. The HyDE method consists of creating hypothetical context chunks that answer the query and using them to match the real context that will help the LLM answer.

In [None]:
actual_document = """
Berkson's paradox, also known as Berkson's bias, collider bias, or Berkson's fallacy, is a result in conditional probability
and statistics which is often found to be counterintuitive, and hence a veridical paradox. It is a complicating factor arising in
statistical tests of proportions. Specifically, it arises when there is an ascertainment bias inherent in a study design. The effect is
related to the explaining away phenomenon in Bayesian networks, and conditioning on a collider in graphical models.

It is often described in the fields of medical statistics or biostatistics, as in the original description of the problem by Joseph Berkson.
"""

actual_document_emb = embeddings_model.embed_documents([actual_document])

In [None]:
system_hyde = """You are an expert at using a question to generate a document useful for answering the question.

Given a question, generate a paragraph of text that answers the question.
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_hyde),
        ("human", "{question}"),
    ]
)

chain = prompt | model

hypothetical_document = chain.invoke({
    "question": """What does Berkson's paradox consist on?"""
})
hypothetical_document

AIMessage(content="Berkson's paradox, also known as Berkson's fallacy or Berkson's bias, is a statistical phenomenon that occurs when there is a misleading perception of a negative correlation between two traits, particularly in a selected population. This paradox was first described by Joseph Berkson in 1946 and is often encountered in medical research and epidemiology. It arises when two independent traits appear to be negatively correlated in a sample population, even though they may be uncorrelated or positively correlated in the general population. This effect is typically observed when there is a selection bias in the sampling process, where individuals are included in the study based on criteria related to both traits being examined. As a result, Berkson's paradox can lead to erroneous conclusions about the relationship between variables, highlighting the importance of careful consideration of selection criteria and potential biases in statistical analyses and research design.",

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

question_embeddings = embeddings_model.embed_documents(["What does Berkson's paradox consist on?"])
hypothetical_document_emb = embeddings_model.embed_documents([hypothetical_document.content])

print(f"Similarity without HyDE: {cosine_similarity(question_embeddings, actual_document_emb)}")
print(f"Similarity with HyDE: {cosine_similarity(hypothetical_document_emb, actual_document_emb)}")

Similarity without HyDE: [[0.86675572]]
Similarity with HyDE: [[0.95113282]]


## Example: RAG with vs without Query Rewriting

Taking the RAG pipeline from the last article, “How to build a basic RAG app”, we will introduce Query Rewriting into it. We will ask it a question a bit more advanced than last time and observe whether the response improves with Query Rewriting over without it. First, let's build the same RAG pipeline. Only this time, I'll only use the top document returned from the vector database to be less forgiving to missed documents.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

document_url = "https://arxiv.org/pdf/2312.10997.pdf"
loader = PyPDFLoader(document_url)
pages = loader.load()

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=40,
    length_function=len,
    is_separator_regex=False,
)
chunks = text_splitter.split_documents(pages)

In [None]:
chunk_texts = list(map(lambda d: d.page_content, chunks))
embeddings = embeddings_model.embed_documents(chunk_texts)

In [None]:
from langchain_community.vectorstores import FAISS

text_embedding_pairs = zip(chunk_texts, embeddings)
db = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)

In [None]:
query = "Which evaluation tools are useful for evaluating a RAG pipeline?"

contexts = db.similarity_search(query, k=1)

print(contexts[0])

page_content='D. Evaluation Benchmarks and Tools\nA series of benchmark tests and tools have been proposed\nto facilitate the evaluation of RAG.These instruments furnish\nquantitative metrics that not only gauge RAG model perfor-\nmance but also enhance comprehension of the model’s capabil-\nities across various evaluation aspects. Prominent benchmarks\nsuch as RGB, RECALL and CRUD [167]–[169] focus on'


In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert at answering questions based on a context extracted from a document. The context extracted from the document is: {context}"),
        ("human", "{question}"),
    ]
)
chain = prompt | model

response = chain.invoke({
    "context": '\n\n'.join(list(map(lambda c: c.page_content, contexts))),
    "question": query
})
response

AIMessage(content="Based on the context provided, there are several evaluation benchmarks and tools that are useful for evaluating RAG (Retrieval-Augmented Generation) pipelines:\n\n1. RGB: This is mentioned as one of the prominent benchmarks for evaluating RAG models.\n\n2. RECALL: This is another benchmark test specifically noted for RAG evaluation.\n\n3. CRUD: This is also listed as a notable benchmark for assessing RAG performance.\n\nThese benchmarks are described as providing quantitative metrics that serve two main purposes:\n\n1. Gauging RAG model performance\n2. Enhancing comprehension of the model's capabilities across various evaluation aspects\n\nThe context indicates that these tools and benchmarks are designed to facilitate the evaluation of RAG by offering a standardized way to measure and understand the performance of RAG models across different dimensions of evaluation.\n\nIt's worth noting that while these are the specific tools mentioned in the given context, there m

The response is good and based on the context, but it got caught up in me asking about evaluation and missed that I was specifically asking for tools. Therefore, the context used does have information on some benchmarks, but it misses the next chunk of information that talks about tools.

Now, let's implement the same RAG pipeline but now with Query Rewriting. As well as the query rewriting prompts, we have already seen in the previous examples, I'll be using a Pydantic parser to extract and iterate over the generated alternative queries.

In [None]:
from langchain_core.pydantic_v1 import BaseModel, Field

class ParaphrasedQuery(BaseModel):
    """You have performed query expansion to generate a paraphrasing of a question."""

    paraphrased_query: str = Field(
        description="A unique paraphrasing of the original question.",
    )

In [None]:
from langchain.output_parsers import PydanticToolsParser

rewrite_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_rewrite),
        ("human", "{question}"),
    ]
)
llm_with_tools = model.bind_tools([ParaphrasedQuery])
query_analyzer = rewrite_prompt | llm_with_tools | PydanticToolsParser(tools=[ParaphrasedQuery])

queries = query_analyzer.invoke({
    "question": query
})
queries

[ParaphrasedQuery(paraphrased_query='What are some effective metrics or tools for assessing the performance of a RAG system?')]

In [None]:
contexts = []
for query in queries:
  contexts = contexts + db.similarity_search(query.paraphrased_query, k=1)
contexts

[Document(page_content='appraising the essential abilities of RAG models. Concur-\nrently, state-of-the-art automated tools like RAGAS [164],\nARES [165], and TruLens8employ LLMs to adjudicate the\nquality scores. These tools and benchmarks collectively form\na robust framework for the systematic evaluation of RAG\nmodels, as summarized in Table IV.\nVII. D ISCUSSION AND FUTURE PROSPECTS')]

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert at answering questions based on a context extracted from a document. The context extracted from the document is: {context}"),
        ("human", "{question}"),
    ]
)
chain = prompt | model

response = chain.invoke({
    "context": '\n\n'.join(list(map(lambda c: c.page_content, contexts))),
    "question": query
})
response

AIMessage(content="Based on the context provided, there are several effective metrics and tools for assessing the performance of RAG (Retrieval-Augmented Generation) systems:\n\n1. RAGAS: This is mentioned as a state-of-the-art automated tool that uses Large Language Models (LLMs) to evaluate the quality of RAG outputs.\n\n2. ARES: Another advanced automated tool that leverages LLMs for assessing RAG model performance.\n\n3. TruLens: This is also listed as a tool that employs LLMs to judge the quality scores of RAG systems.\n\nThese tools are described as forming part of a robust framework for the systematic evaluation of RAG models. While specific metrics aren't detailed in the given context, it's implied that these tools likely provide various quality scores and performance indicators for RAG systems.\n\nThe context also mentions that these tools and benchmarks are summarized in Table IV, which likely contains more detailed information about specific metrics and evaluation criteria, 

The new query now matches with the chunk of information I wanted to get my answer from, giving the LLM a better chance of answering a much better response for my question.