# Ref: https://github.com/sakunaharinda/ragatouille-book/blob/main/book/2_Query_Transformation.ipynb

# Query Transformation

The main idea behind the Query Transformation is that translate/transform the user query in a way that the LLM can correctly answer the question. For instance, if the user asks an ambiguous question, our RAG retriever might retrieve incorrect (or ambiguous) documents based on the embeddings that are not very relevant to answer the user question, leading the LLM to hallucinate answers. There are few ways to tackle this problem. Some of them are,

- [Step-back prompting](https://arxiv.org/pdf/2310.06117): This involves encouraging the LLM to take a step back from a given question or problem and pose a more abstract, higher-level question that encompasses the essence of the original inquiry.
- [Least-to-most prompting](https://arxiv.org/pdf/2205.10625): This allows to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems.
- Query re-writing ([Multi-Query](https://medium.com/@kbdhunga/advanced-rag-multi-query-retriever-approach-ad8cd0ea0f5b) or [RAG Fusion](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1)): This allows to generate multiple questions from the original question with different wording and perspectives. Then retrieve documents using the similarity scores between each question and the vector store to answer the orginal question.

A blog post about query transformation by Langchain can be found [here](https://blog.langchain.dev/query-transformations/). 

Now, let's try to implement the above techniques using LangChain!


In [None]:
from dotenv import load_dotenv
load_dotenv()
import rich

Similar to the Introduction notebook, we first import the libraries, load documents, split them, generate embeddings, store them in a vector store and create the retriever using the vector store.

In [None]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.load import loads, dumps
from typing import List

In [None]:
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# embedding = OpenAIEmbeddings

In [None]:
loader = DirectoryLoader('../../pdf_files/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=embedding,
                                    persist_directory="data/vectorstore")
vectorstore.persist()

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={'k':5})

## Query Translation

### Multi-Query

In multi-query approach, we first use an LLM (here it is an instance of GPT-4) to generate 5 different questions based on our original question. To do that, we create a prompt and encapsulate it with the `ChatPromptTemplate`. Then we create the chain using LCEL, to read the user input and assign it to the `question` placeholder of the prompt, send the prompt to the LLM, parse the output containing 5 questions seperated by new line charcters.

In [None]:
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """
    You are an intelligent assistant. Your task is to generate 5 questions based on the provided question in different wording and different perspectives to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newline `\n`.
    
    Original question: {question}
    """
)

generate_queries = (
    {"question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n\n"))
)

We can check whether or not our query generation works by invoking the created chain with a query.

In [None]:
generate_queries.invoke("What need to consider when using LLM to eval LLM generation?")

Once we get the 5 questions, we parallelly retrieve the most relevant 5 documents for each question (resulting in a list of lists) and create a new document list by taking the unique documents of the union of all the retrieved documents. To do that we create another chain, `retrieval_chain` using LCEL.

In [None]:
def get_context_union(docs: List[List]):
    all_docs = [dumps(d) for doc in docs for d in doc]
    unique_docs = list(set(all_docs))
    
    return [loads(doc).page_content for doc in unique_docs] # We only return page contents


retrieval_chain = (
    {'question': RunnablePassthrough()}
    | generate_queries
    | retriever.map()
    | get_context_union
)
    

In [None]:
retrieval_chain.invoke("What need to consider when using LLM to eval LLM generation?")

Finally we put all together by creating a one final chain to read the user query, get the contexts from 5 different documents using the `retrieval_chain`, add both the question and context to the prompt, send it through the LLM, and get the final formatted output using  the `StrOutputParser`.

In [None]:
prompt = ChatPromptTemplate.from_template(
    """
    Asnwer the given question using the provided context.\n\nContext: {context}\n\nQuestion: {question}
    """
)

multi_query_chain = (
    {'context': retrieval_chain, 'question': RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0)
    | StrOutputParser()
)

In [None]:
multi_query_chain.invoke("What need to consider when using LLM to eval LLM generation?")

After executing all the above cells, you will be able to see a LangSmith trace like [this](https://smith.langchain.com/public/31d1e43a-3727-4d0b-82fb-2bbdf146dfac/r).

### RAG Fusion

In the default multi-query approach, after we retrieved the relevant documents for each question generated for our original question, we take the union of all the documents to select only unique documents (same document can be retrieved by multiple questions). However, we did not pay attention to the rank of each document in the context, which is important for the LLM to produce the most correct answer. Because the each individual rank would help us to decide the top-k documents to select as the context if we have a huge number of documents with a limited context window of the LLM. Therefore in RAG Fusion, while we do exactly the same thing upto retrieving documents, we use [Reciprocal Rank Fusion (RRF)](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking) to rank the each retrieved document before using them as the context to answer our original question. 

In [None]:
def rrf(results: List[List], k=60):
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0

            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]

            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

            print(f"Updating score for {doc} from {previous_score} to {fused_scores[doc]} based on rank {rank}")

    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(ranked_doc), score)
        for ranked_doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results

The only difference between the below code compared to the multi-query code we went through earlier is, now we use our `rrf` method instead of `get_context_union` to retrieve the final list of documents related to our original question (i.e., context).

In [None]:
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """
    You are an intelligent assistant. Your task is to generate 4 questions based on the provided question in different wording and different perspectives to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question: {question}
    """
)

generate_queries = (
    {"question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)


fusion_retrieval_chain = (
    {'question': RunnablePassthrough()}
    | generate_queries
    | retriever.map()
    | rrf
)


In [None]:
result = fusion_retrieval_chain.invoke("What need to consider when using LLM to eval LLM generation?")

In [None]:
rich.print(result)

Here we format the context by considering only the page contents without meta data or re-ranking scores. 

In [None]:
def format_context(documents: List):
    return "\n\n".join([doc[0].page_content for doc in documents])


prompt = ChatPromptTemplate.from_template(
    """
    Asnwer the given question using the provided context.\n\nContext: {context}\n\nQuestion: {question}
    """
)

multi_query_chain = (
    {'context': fusion_retrieval_chain | format_context, 'question': RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0)
    | StrOutputParser()
)

In [None]:
multi_query_chain.invoke("What need to consider when using LLM to eval LLM generation?")

After executing all the above cells, you will be able to see a LangSmith trace like [this](https://smith.langchain.com/public/99c5fb68-0ccf-4508-a72d-7c3a7b5e61d2/r).

## Query Decomposition

In "Query Translation", we focused on generating multiple questions from our original question with different perspectives (i.e., translate the query) to improve RAG.  
However, the generated questions all do have the same meaning despite the wording is different, since it is in fact translation. Therefore, the answers for all the questions are somewhat similar. As a result, while the multi-query approach helps avoid ambiguities of the user query by writing it in different ways, `it will not help when the user query is complex` (e.g., a long mathematical computation).

As a solution we can break down (i.e., decompose) the original query into multiple sub-problems (like in recursion or dynamic programming) and answer each sub-problem sequentially/parallelly to derive the answer to our original query. This simplifies the prompts and increases the context for the retrieval process. We do that using `"Query Decomposition"`.

### Least-to-Most Prompting

First let's look at how to implement [Least-to-Most Prompting](https://arxiv.org/pdf/2205.10625) to break down a complex query into subquestions and answer them recursively to derive the final answer. 

Similar to the multi-query and RAG fusion we first have generate a few questions based on our original questions. However our prompt should be different as we are generating sub questions by decomposing the original one, instead of generating the same question with different perspectives. 

In [None]:
from langchain.prompts import ChatPromptTemplate

decompostion_prompt = ChatPromptTemplate.from_template(
    """
    You are a helpful assistant that can break down complex questions into simpler parts. \n
        Your goal is to decompose the given question into multiple sub-questions that can be answerd in isolation to answer the main question in the end. \n
        Provide these sub-questions separated by one newline character. \n
        Original question: {question}\n
        Output (3 queries): 
    """
)

query_generation_chain = (
    {"question": RunnablePassthrough()}
    | decompostion_prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n\n"))
)


In [None]:
questions = query_generation_chain.invoke("What need to consider when using LLM to eval LLM generation?")
questions

After generating the sub-questions, we iterate through them to answer them individually using the `least_to_most_chain`. We first extract the `question` from the user input using the `itemgetter` and provide it to our `retriever` to retrieve related documents as the `context`. `q_a_pairs` will also be provided as part of the user input. Then we populate our prompt and send to the LLM to get the answer. Each time we store the sub-question `Q_{n-1}` and its answer `A_{n-1}` since we provide them as the context to answer the question `Q_{n}`.

In [None]:
from operator import itemgetter


# Create the final prompt template to answer the question with provided context and background Q&A pairs
template = """Here is the question you need to answer:
\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:
\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 
\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

least_to_most_prompt = ChatPromptTemplate.from_template(template) 
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

least_to_most_chain = (
        {'context': itemgetter('question') | retriever,
        'q_a_pairs': itemgetter('q_a_pairs'),
        'question': itemgetter('question'),
        }
        | least_to_most_prompt
        | llm
        | StrOutputParser()
    )

q_a_pairs = ""
for q in questions:
    answer = least_to_most_chain.invoke({"question": q, "q_a_pairs": q_a_pairs})
    q_a_pairs+=f"Question: {q}\n\nAnswer: {answer}\n\n"

After getting answers for the 3 generated sub-questions, finally we answer our original question by invoking the `least_to_most_chain` once again, but this time with the original question and all `q_a_pairs`.

In [None]:
result = least_to_most_chain.invoke({"question": "What need to consider when using LLM to eval LLM generation?", "q_a_pairs": q_a_pairs})

In [None]:
rich.print(result)

The LangSmith trace for the original question answer will look like [this](https://smith.langchain.com/public/7bd7f987-a53a-4d32-abb0-823940bc3f27/r).

Instead sequentially answering the sub-questions, we can use the LLM to answer them parallely and use those answers to derive the final answer to our main question.

In [None]:
prompt = hub.pull('rlm/rag-prompt')
rich.print(prompt)

In [None]:
def generate_and_answer(question):
    
    questions = []
    
    sub_questions = query_generation_chain.invoke(question)
    
    sub_qa_chain = (
        {'context': RunnablePassthrough() | retriever, 'question': RunnablePassthrough()}
        | prompt
        | ChatOpenAI(model='gpt-4o-mini', temperature=0)
        | StrOutputParser()
    )
    
    for q in sub_questions:
        answer = sub_qa_chain.invoke(q)
        questions.append({"question": q, "answer": answer})
        
    return questions

In [None]:
result = qa_pairs = generate_and_answer("What need to consider when using LLM to eval LLM generation?")

In [None]:
rich.print(result)

In [None]:
def format_qa_pairs(qa_pairs):
    
    formatted_string = ""
    
    for i, qa in enumerate(qa_pairs):
        formatted_string += f"Question {i}: {qa['question']}\nAnswer {i}: {qa['answer']}\n\n"
    return formatted_string.strip()

context = format_qa_pairs(qa_pairs)

# Prompt

prompt = ChatPromptTemplate.from_template(
    """
    Consider the following Question and Answer Pairs:

    {context}

    Use these to synthesize an answer to the question: {question}
    """
)

final_rag_chain = (
     prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0)
    | StrOutputParser()
)

In [None]:
result = final_rag_chain.invoke({'context': context, 'question': "What need to consider when using LLM to eval LLM generation?"})

In [None]:
rich.print(result)

The LangSmith trace for answering the original question will look like [this](https://smith.langchain.com/public/d5a17200-7752-42cb-87b9-146959e691bc/r).

### Step back prompting

[Step back prompting](https://arxiv.org/pdf/2310.06117) allows LLMs to step back through in-context learning – prompting them to derive high-level abstractions such as concepts and principles for a specific example (i.e., Abstraction). Then, grounded on the documents regarding the high-level concept or principle, the LLM can reason about the solution to the original question (i.e., Reasoning).

E.g., If the original question is "What happens to the pressure, P, of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?", a possible step-back question would be "What are the physics principles behind this question?". Then the context (i.e., documents) retrieved for the step-back question will be used as additional context to answer the original question.

To generate such step-back questions, we use few-shot learning to provide a few examples of (question, step-back question) pairs to the LLM.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

examples = [
    {
        'input': 'What happens to the pressure, P, of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?',
        'output': 'What are the physics principles behind this question?'
    },
    {
        'input': 'Estella Leopold went to which school between Aug 1954 and Nov 1954?',
        'output': "What was Estella Leopold's education history?"
    }
]
example_prompt = ChatPromptTemplate.from_messages(
            [
                ('human', '{input}'), ('ai', '{output}')
            ]
        )
few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=examples,
            # This is a prompt template used to format each individual example.
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages(
            [
                ('system', """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:"""),
                few_shot_prompt,
                ('user', '{question}'),
            ]
        )

In [None]:
rich.print(final_prompt.format(question= "What need to consider when using LLM to eval LLM generation?"))

Then we use the created few-shot prompt to generate the step-back question through a chain.

#### chain type 1

In [None]:
step_back_query_chain = (
    {'question': RunnablePassthrough()}
    | final_prompt 
    | ChatOpenAI(model='gpt-4o-mini', temperature=0.7) 
    | StrOutputParser()
    )

step_back_query_chain.invoke("What need to consider when using LLM to eval LLM generation?")



#### chain type 2

In [None]:
step_back_query_chain = (
    final_prompt 
    | ChatOpenAI(model='gpt-4o-mini', temperature=0.9) 
    | StrOutputParser()
    )

step_back_query_chain.invoke({"question": "What need to consider when using LLM to eval LLM generation?"})



Finally, we use both the context retrieved for the original question and the context retrieved for the step-back question to answer our original question via the `step_back_chain`.  

In [None]:
response_prompt_template = """You are an expert of world knowledge. 
I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. 
Otherwise, ignore them if they are not relevant.

<normal_context>
# {normal_context}
</normal_context>

<step_back_context>
# {step_back_context}
</step_back_context>


# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

step_back_chain = (
    {'normal_context': RunnablePassthrough() | retriever,
     'step_back_context': RunnablePassthrough() | step_back_query_chain | retriever,
     'question': RunnablePassthrough()
     }
    | response_prompt
    | ChatOpenAI(model='gpt-4o-mini', temperature=0)
    | StrOutputParser()
)

In [None]:
step_back_chain.invoke("What need to consider when using LLM to eval LLM generation?")

In [None]:
test_step_back_chain = (
    {'normal_context': RunnablePassthrough(),
     'step_back_context': RunnablePassthrough() | step_back_query_chain,
     'question': RunnablePassthrough()
     }
    | response_prompt)

res = test_step_back_chain.invoke("What need to consider when using LLM to eval LLM generation?")
rich.print(res)

The LangSmith trace for the implemented step-back prompting chain will look like [this](https://smith.langchain.com/public/425c098b-47ae-4f53-9259-8cd6b567a2b0/r).

In this notebook, we looked at ways to improve the LLMs answers to a user query through the "Query Transformation". In summary, query transformation may help us to remove ambiguities of the user query and simplify it through techniques such as ,

- **Multi-query**: That re-writes the question in different perspectives (i.e., sub-questions).
- **RAG Fusion**: That not only re-writes the question in different perspectives, but also rank the documents retrieved for each sub-question to provide the most relevant information to answer the original question.

- **Least-to-Most Prompting**: That helps break-down complex questions into mutiple sub problems and answer the final question using the sub problems and their answers as the context.
- **Step-back Prompting**: That generates a step-back question and use the retrieved documents for that step-back question as the additional context to answer the original question.

In the next section, we will  generate Hypothetical Documents, instead of questions to help LLMs answer questions more accurately through [HyDE](https://arxiv.org/pdf/2212.10496) (Hypothetical Document Embeddings).