# Query Translation

The main idea behind the Query Translation is that translate the user query in a way that the LLM can correctly answer the question. For instance, if the user asks an ambiguous question, our RAG rretriever might retrieve incorrect (or ambiguous) documents based on the embeddings that are not very relevant to answer the user question, leading the LLM to hallucinate answers. There are few ways to tackle this problem. Some of them are,

- [Step-back prompting](https://arxiv.org/pdf/2310.06117): This involves encouraging the LLM to take a step back from a given question or problem and pose a more abstract, higher-level question that encompasses the essence of the original inquiry.
- [Least-to-most prompting](https://arxiv.org/pdf/2205.10625): This allows to break down a complex problem into a series of simpler subproblems and then solve them in sequence Solving each subproblem is facilitated by the answers to previously solved subproblems.
- Query re-writing ([Multi-Query](https://medium.com/@kbdhunga/advanced-rag-multi-query-retriever-approach-ad8cd0ea0f5b) or [RAG Fusion](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1)): This allows to generate multiple questions from the original question with different wording and perspectives. Then retrieve documents using the similarity scores between each question and the vector store to answer the orginal question.

Now, let's try to implement the above techniques using LangChain!


In [1]:
%load_ext dotenv
%dotenv secrets/secrets.env

Similar to the Introduction notebook, we first import the libraries, load documents, split them, generate embeddings, store them in a vector store and create the retriever using the vector store.

In [9]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from typing import List

In [3]:
loader = DirectoryLoader('data/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=OpenAIEmbeddings(),
                                    persist_directory="data/vectorstore")
vectorstore.persist()

  warn_deprecated(


In [4]:
retriever = vectorstore.as_retriever(search_kwargs={'k':5})

### Multi-Query

In multi-query approach, we first use an LLM (here it is an instance of GPT-4) to generate 5 different questions based on our original question. To do that, we create a prompt and encapsulate it with the `ChatPromptTemplate`. Then we create the chain using LCEL, to read the user input and assign it to the `question` placeholder of the prompt, send the prompt to the LLM, parse the output containing 5 questions seperated by new line charcters.

In [27]:
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """
    You are an intelligent assistant. Your task is to generate 5 questions based on the provided question in different wording and different perspectives to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question: {question}
    """
)

generate_queries = (
    {"question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

We can check whether or not our query generation works by invoking the created chain with a query.

In [28]:
generate_queries.invoke("What is QLoRA?")

['Can you provide information on QLoRA?',
 'What does QLoRA stand for?',
 'Can you explain the concept of QLoRA?',
 'Could you elaborate on what QLoRA is?',
 'What are the details about QLoRA?']

Once we get the 5 questions, we parallely retrieve the most relevant 5 documents for each question (resulting in a list of lists) and create a new document list by taking the unique documents of the union of all the retrieved documents. To do that we create another chain, `retrieval_chain` using LCEL.

In [29]:
from langchain.load import loads, dumps

def get_context_union(docs: List[List]):
    all_docs = [dumps(d) for doc in docs for d in doc]
    unique_docs = list(set(all_docs))
    
    return [loads(doc).page_content for doc in unique_docs] # We only return page contents


retrieval_chain = (
    {'question': RunnablePassthrough()}
    | generate_queries
    | retriever.map()
    | get_context_union
)
    

In [30]:
retrieval_chain.invoke("What is QLoRA?")

['trade-off exactly lies for QLoRA tuning, which we leave to future work to explore.\nWe proceed to investigate instruction tuning at scales that would be impossible to explore with full\n16-bit finetuning on academic research hardware.\n5 Pushing the Chatbot State-of-the-art with QLoRA\nHaving established that 4-bit QLORAmatches 16-bit performance across scales, tasks, and datasets\nwe conduct an in-depth study of instruction finetuning up to the largest open-source language models',
 'investigations of the tradeoffs of simple cross-entropy loss and RLHF training. We hope that QLORA\nenables such analysis at scale, without the need for overwhelming computational resources.\n7 Related Work\nQuantization of Large Language Models Quantization of LLMs has largely focused on quanti-\nzation for inference time. Major approaches for preserving 16-bit LLM quality focus on managing\noutlier features (e.g., SmoothQuant [ 66] and LLM.int8() [ 14]) while others use more sophisticated',
 'Quantiza

Finally we put all together by creating a one final chain to read the user query, get the contexts from 5 different documents using the `retrieval_chain`, add both the question and context to the prompt, send it through the LLM, and get the final formatted output using  the `StrOutputParser`.

In [31]:
prompt = ChatPromptTemplate.from_template(
    """
    Asnwer the given question using the provided context.\n\nContext: {context}\n\nQuestion: {question}
    """
)

multi_query_chain = (
    {'context': retrieval_chain, 'question': RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0)
    | StrOutputParser()
)

In [32]:
multi_query_chain.invoke("What is QLoRA?")

'QLoRA is an efficient adaptation strategy for language models that allows for quick task-switching when deployed as a service by sharing the majority of the model parameters. It does not introduce inference latency nor reduces input sequence length while retaining high model quality. QLoRA introduces a number of innovations to save memory without sacrificing performance, such as 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights. It enables the finetuning of large language models on phones and other low resource settings, making the finetuning of high quality LLMs much more widely and easily accessible.'

After executing all the above cells, you will be able to see a LangSmith trace like [this](https://smith.langchain.com/public/f38c02d1-23a5-4961-a076-3ff20a872d45/r).

### RAG Fusion